Become a member of BioTechniques (it's free!) and receive the latest news in the life sciences and member-exclusives.

Analytical approaches to address proteome complexity

Written by Jens Coorssen (Brock University)

Following on from his exposé of the dearth of specificity in our vocabulary surrounding the proteome and his emphasis that individual proteoforms should be the focus of our proteomic investigations, Jens Coorssen (Brock University, St Catherines, Canada) is back. Here, he takes a look at current analytical techniques and reviews their compatibility with truly deep, complex proteomic analysis.


Arthur Schopenhauer once said that, “all truth passes through three stages. First, it is ridiculed. Second, it is violently opposed. Third, it is accepted as being self-evident.” Effective analytical strategy – to identify rational drug targets, biomarkers, and environmental indicators – always lies in the long game rather than low-hanging fruit. While seemingly an inconvenient truth, it is abundantly clear that proteomes are not simply canonical amino acid sequence (i.e., ‘protein’) representations of the genome. Complexity prevails, with myriad protein species – proteoforms – defining proteomes and being the functional entities driving biology. Here, we build on a previous introduction to proteome complexity and the critical need to resolve and identify proteoforms to understand their inherent impact on systems.

Complexity and analytical strategies

As the fundamental constituents of proteomes, proteoforms must be the target analytes of any proteome analysis genuinely considered as ‘(ultra)deep’ or ‘comprehensive’. Anything less is proteogenomics, which, via the transcriptome, only links back to identified genes. However, posttranslational modifications (PTMs), adducts, and a host of other variations at the systems level of proteoform functions, cannot be predicted from nucleic acid sequences. Further non-canonical species are also being identified as genome complexity is more thoroughly dissected (e.g., microproteins). Extensive, direct and routine analysis of proteoforms is thus essential to fully and effectively address the complexity that we now appreciate as defining proteomes.

In light of what might questionably be argued as differing definitions of a proteome, there are currently at least two major approaches to proteome analysis: Bottom-up and Top-down [1,2]. Although original methods had already addressed the issue of resolving proteoforms, bottom-up/‘shotgun’/proteo-genomics came to the fore ~20 years ago, piggybacking on the initial completion of the human genome. This purely proteogenomic approach consists of en masse digestion of whole proteome extracts (usually using trypsin), analysis of the resulting peptides using liquid chromatography (LC) and tandem mass spectrometry (MS/MS or TMS), and inferring canonical amino acid sequence (i.e., ‘protein’) identifications by matching where possible the resulting peptide data to sequences in databases. Largely software-driven, inappropriate identifications result if spectral quality is not critically monitored [3]. Nonetheless, large amounts of data often remain unassigned.

Bottom-up has critical limitations

Although claimed to be fast, this supposed benefit becomes questionable if analytical replicates are appropriately carried out for every technical replicate in an experiment. This is essential due to the relatively poor run-to-run reproducibility of the purely LC/TMS approach. Not doing so often results in ‘identifications’ based on only 1-2 peptides that do not approximate coverage of the whole canonical sequence, raising issues of size variants and whether intact ‘protein’ is actually present. Assumptions prevail.

Somewhat more stringent variations first resolve proteome extracts by one-dimensional (1D) SDS-PAGE, thereby identifying size variants. Nonetheless, any band on a 1D gel represents myriad proteoforms due to the low overall resolution of the technique. This further characterizes the fatal shortcoming of the purely bottom-up analytical approach – a failure to broadly identify proteoforms and truly address proteome complexity at a level necessary to effectively understand systems biology.

More thorough peptide analysis provides identification of PTMs (e.g., phosphorylation and glycosylation) and claims to thereby identify proteoform ‘groups’. Issues associated with this approach include low proteoform abundance (and thus peptides with modifications) and complicated relationships between amino acid sequence abundance vs proteoform abundance [1,2]. Beyond these challenges, as critical data are lost during the gross proteolytic digestion that forms the basis of the shotgun method, ‘groups’ are essentially another peptide cataloging exercise, since specific proteoforms are not identified. In effect, this is like saying one thinks they may have seen some ‘Ford’ cars (possibly a certain model) and some might be a certain shade of red. This is clearly meaningless in terms of distinct identification of critical proteoforms or quantitative analysis thereof. Nonetheless, such technical developments in mass spectrometry and data analysis are strong complements and enablers of more thorough integrative approaches to genuine deep proteome analysis (see below).

Further issues arise as quantitative PTM assessments require separate peptide isolation methods and analyses for each type of PTM; mainly being affinity based, these methods come with inherent issues of selectivity, specificity and quantitative thoroughness. Furthermore, with >400 currently known PTMs, such routine analyses with appropriate technical and experimental replicates would be impossible even with sufficient samples. Recognizing that each MS analysis must be a separate, sequential run, claims of ‘fast’ and ‘deep’ analyses become moot if not utterly untenable.

Currently, bottom-up proteomics seeks to establish a niche for itself in the developing areas of spatial and single-cell proteomics; in line with fitness-for-purpose [2], proteogenomics is the only fall-back in these cases. Nonetheless, each could deliver genuinely deep proteomic assessments but require specific methods/tool development and the willingness to undertake such critical analyses. Thus, the bottom line on bottom up is that it can provide a general proteogenomic scan of canonical ‘proteins’ that may be present in a sample.

Fundamentally critical information is irretrievably lost when the inherent intact proteoform complexity of native proteomes is initially destroyed, creating an even more complex (and confusing) mass of peptides, which mass spectrometers handle better than intact proteoforms. Yet, how many critical species would already be identified had a more rigorous methodology prevailed? Any identified canonical ‘protein’, no matter how thorough the sequence coverage achieved, likely represents many proteoforms that (transiently) exist at the time of sampling. While the temporal (and spatial) issues are more difficult to address, integrative analyses – resolving intact proteoforms prior to their identification – are the only pathway to realistically ‘deep’ proteome analysis.

If not bottom up then…

Top-down proteomics is well described as the original approach to proteomic analysis [1]. As indicated by the adopted analytical chemistry term, ‘top-down’ involves first resolving intact proteoforms prior to their analysis and identification. This then directly addresses the complexity of native proteomes and the needs of systems biology. There are two main approaches to top-down proteomics: MS-intensive and integrative, respectively [1]. Although the likely complementarity of these approaches has been proposed [1], it has not been systematically explored, likely due to difficulties in quantitatively recovering intact proteoforms from polyacrylamide gels. Among other issues, methods for dissolving gels likely alter proteoforms [1], and results from limited tests have been disappointing, with low recoveries.

MS-intensive analyses seek to assess proteomes by analyzing intact proteoforms directly in the mass spectrometer [1,2]. In reality, 1D gel-based separations of proteome extracts are first required, minimally to isolate low from high molecular weight (MW) proteoforms. This is mandatory as current methods, including the highest resolution Fourier transform ion cyclotron resonance mass spectrometers and multiple high energy dissociation approaches (e.g., collision-induced, electron-capture, electron-transfer), cannot routinely assess proteoforms larger than 20-30kDa [1]. Mild digestion methods yielding larger peptides have also been used, but this still does not broadly address the need to analyze complex native proteomes. Thus, while quite thorough in identifying lower MW proteoforms, there are few examples of larger species being identified.

Other issues include relatively low throughput, co-ionization/spectrum complexity, poor coverage of low abundance species, as well as sampling/sample handling issues that are general to proteomics (e.g., proteoform solubility); there are efforts to address these limitations. Thus, while powerful and promising, MS-intensive top-down analyses are currently limited. It is impossible to predict when whole proteomes might be effectively analyzed and therefore when a fully operational method might be widely available. Currently, the approach provides targeted, cutting-edge strength in analyzing highly purified samples, some molecular complexes for example. The complicated proteoform patterns in such isolated samples further reinforce the inherent complexity of native proteomes that must be addressed.

Integrative top-down analysis

In contrast to other approaches, integrative top-down proteomic analyses capitalize and expand on proven analytical approaches [1,4], integrating front-end proteoform isolation by 2D gel electrophoresis (2DE; with decades of established quantitative optimizations) with cutting-edge LC/TMS. Most 2DE protocols resolve intact proteoforms first by net charge (isoelectric point; pI) via isoelectric focussing in an immobilized pH gradient (IPG), and then according to size (nominal MW) by SDS-PAGE. Other 2D gel variations are also well established, such as Blue- or Clear-native/SDS-PAGE and 16-BAC/SDS-PAGE.

Refinements to 2DE include the use of gradient gels, third separations (3DE) to enhance the resolution of co-migrating species, PTM-selective and high sensitivity total proteome stains, and deep imaging to visualize even low abundance proteoforms. Recently, enhanced sample reduction and micro-perforating IPG strips have further optimized analyses and increased throughput [1,4]. Furthermore, multiple, replicate gels are simultaneously resolved rather than the time- and resource-intensive sequential replications required in shotgun and MS-intensive analyses.

Advanced image analysis programs enable the identification of changes of interest – proteoform ‘spots’ altered in abundance or by other criteria. Each spot becomes a focused bottom-up study; databases are used to identify corresponding sequences and the pI, MW and selective staining information from 2DE is used to verify identification of a proteoform by variance from canonical pI and/or MW. What’s more, peptides can also be more deeply analyzed for PTMs.

As chromatography, including electro-chromatography using gels, is in part limited by available resolving ‘space’ (i.e., theoretical plates), it is not surprising that narrow pI gradient ‘zoom’ gels, 3DE and high-resolution MS have established that there are often multiple proteoforms in a given spot resolved by 2DE. While these apparently co-migrating proteoforms are likely resolved from each other, separate species are not discernable as separate spots at the macro level of a stained gel but often appear as a single stained spot (e.g., overlapping LC peaks). Thus, although detection sensitivity has progressively increased with new stains and detection methods, discerning separate ‘micro’ spots from their closely resolved neighboring proteoforms is not possible with current gel imaging instrumentation. Alternately, a whole 2D gel can be finely sectioned and thus essentially entire proteomes can be analyzed.

Nonetheless, the capacity to further resolve and identify the individual proteoforms within visualized spots using 3DE and/or TMS, or to section and analyze whole gels, further reinforces the critical and unprecedented power of integrative top-down proteomics. This approach is estimated to be quite capable of resolving ≥1 million proteoforms from complex native proteomes, with reasonably thorough sequence coverage and high confidence in proteoform identifications across resolved pI and MW ranges. This elegantly coupled approach thus constitutes state-of-the-art proteomic analysis, capitalizing on the strengths and minimizing the limitations of both 2DE and LC/TMS.

Alternative ‘new’ analysis

Proposed and/or developing proteomic analyses attract considerable attention and investment. Critically considered, these largely reimagine bottom-up methods, yielding the same information using alternate approaches to isolate and identify ‘proteins’. Although having the sheen of newness, these methods still do not enable deep proteoform/proteome analysis.

Currently, adaptations of Edman degradation, (sub)nanopores, dendrimers, aptamers, affinity matrices, BioID (and associated ‘proximity’ methods), CITE-seq and other RNA- and antibody-based techniques only identify canonical proteins. Not to detract from the potential of some approaches, but each has its own inherent technical limitations and these are further complicated if proteoforms are the correctly intended analytes for critical quantitative analyses.

Should identified ‘proteins’ prove important, then these must still be further pursued to the proteoform level by yet other methods. Regrettably, such identifications are rarely if ever rigorously pursued. Simply cataloging results does not constitute translation into knowledge nor working solutions that stand the test of time.

Critical considerations and directions

It is essential to realize that there is no panacea to proteome analysis that currently guarantees every copy of every type of proteoform present in a sample is quantitatively extracted or survives unaltered through the many sample handling and processing steps found throughout the literature. Nor can we comprehensively ensure that any of the resulting peptides for any of the proteoforms are handled equally well or quantitatively ionized during LC or TMS, respectively. For example, phosphopeptides are generally of lower abundance than unmodified peptides, ionize poorly, and the phosphate group is labile, all resulting in less spectra. Enrichment to concentrate the analyte of interest is the favoured approach to address these issues but these methods also have inherent limitations. 2DE addresses this, at least in part, as it isolates proteoforms prior to analysis. Additionally, new software tools are helping to address labile modifications.

Further complexity arises with non-canonical phosphorylations. In addition to the classical phosphorylation of serine, threonine, and tyrosine, it is now known that arginine, aspartate, cysteine, glutamate, histidine, and lysine can also be modified, tuning proteoform functions, although this remains unaddressed in most reported phospho-protein/-peptide analyses.

The objective must be to continuously strive for the best, deepest possible analyses. This can only be achieved by systematically and quantitatively refining, optimizing and evolving methods rather than simply repeating what others have done and/or commercialized, with all the inherent shortcomings. Rather than the status quo approach, that seems most prevalent, driving forward with constant critical optimization and innovation will yield a genuine systems-level understanding of molecular mechanisms and identification of highly selective biomarkers and drugs.

“People who say it cannot be done, should not interrupt those who are doing it.”

Attributed to George Bernard Shaw


About the author

Jens Coorssen (left) is a Professor in the Faculty of Applied Health Sciences at Brock University (St. Catharines, Canada), a Research Scholar a the Ronin Institute (NJ, USA) and a member of the Institute for Globally Distributed Open Research and Education (IGDORE), an independent research institute dedicated to improving the quality of science, science education, and quality of life for scientists, students and their families.