The Zhao Bioinformatics Laboratory
AtSubP: the Arabidopsis Subcellular Localization Prediction Server  
  About | Submit | Datasets | Appendix 
Location:  ABOUT
INTRODUCTION
 
The present tool is developed as a part of our broader study focused on investigating the advantages of developing "organism-specific" predictors over the "general" ones to predict protein subcellular localization. To demonstrate this, we perform a systematic study in Arabidopsis thaliana and create an integrative Support Vector Machine-based localization predictor called 'AtSubP' that is based on the combinatorial presence of diverse protein features such as its amino acid composition, sequence-order effects, terminal information, PSSM and the similarity search-based PSI-BLAST information. When used to predict on seven compartments (Chloroplast, Cytoplasm, Golgi apparatus, Mitochondrion, Extracellular, Nucleus, Plasma membrane) through a 5-fold cross-validation test, our hybrid-based best classifier achieves an overall accuracy of 90.95% with a high confidence precision and MCC values of 90.86% and 0.89, respectively. Benchmarking AtSubP on two independent datasets, one from Swiss-Prot and another containing GFP & Mass Spectrometry determined proteins from SUBA and eSLDB databases, shows a significant improvement in the prediction accuracy of 'species-specific' AtSubP over some widely used 'general' tools such as TargetP, LOCtree, PA-SUB, MultiLoc, WoLF PSORT and Plant-PLoc as well as our newly created 'All-Plant' method. As another rigorous testing, the cross-comparison of Arabidopsis-specific classifier on six non-trained eukaryotic organisms (Rice, Soybean, Human, Yeast, Fruit fly, Worm) reveals too inferior predictions.

Thereby, this comprehensive case study strongly reveals the presence of some 'species-specific' sorting patterns or signals among the individual organisms that are being skipped in the training process while developing 'general' prediction methods. We suggest to actively develop similar genome-specific systems in other organisms to fasten-up their individual proteome annotation process rather than relying on some general prediction tools. Five diverse prediction modules based on various features of a protein sequence have been implemented on the World Wide Web as a dynamic web server 'AtSubP' that provide wider options to the users extracting different features from their query protein sequences e.g. the simple amino acid composition, sequence-order based dipeptide composition, terminal-based information, Position Specific Scoring Matrix (PSSM), similarity-based PSI-BLAST, including our best performing hybrid classifier.

Currently, the TAIR community is actively using TargetP for annotating the complete Arabidopsis proteome (ftp://ftp.arabidopsis.org/home/tair/Proteins/Properties/TargetP_analysis.tair9). However, in our independent comparison on various test datasets, 'AtSubP' shows better accuracy and wider location coverage as compared to 'TargetP'. Therefore, we believe that 'AtSubP' can serve as a better complement to accurately annotate the Arabidopsis thaliana proteome. The complete list of subcellular predictions generated through 'AtSubP' are available under the 'Datasets' section.


If you are using this server, please cite:
Kaundal, R., Saini, R. and Zhao, P.X. 2010. Combining Machine Learning and Homology-based Approaches to Accurately Predict Subcellular Localization in Arabidopsis. Plant Physiology 154(1): 36-54.
Unique visitor counter: 
  © 2009 by The Samuel Roberts NOBLE Foundation, Inc.