Promoter software prediction




















We estimated the performance of the predictive models using: sensitivity Sn , specificity Sp , Precision the positive predictive value, P1 , Accuracy a measure of statistical bias, Ac , Negative predictive value P2 , the F 1 -score the harmonic mean of Precision and Accuracy, F1 and the Mathew correlation coefficient MCC. These statistical measures are briefly described in the Supplementary material.

The algorithm of bTSSfinder is depicted in Figure 1. More details about the algorithm are given in the Results section Section 3. The algorithm is implemented in the bTSSfinder tool.

Flow-chart of the algorithm implemented in the bTSSfinder program. T is the threshold for the prediction of the box specific for every sigma class. The largest collection of experimentally validated promoters of E. Unfortunately, no such classification exists for cyanobacterial promoters. So far, the and boxes have been identified or predicted in a handful of promoters. Our preliminary comparison of E. Combining E.

We identified over 30 prospective features that may exert specificity for the different promoter classes. To cull the feature space to those with the highest predictive power, we calculated Mahalanobis distances for each feature and reduced the number to 19—21 features depending on the promoter class Supplementary Table S3.

To the best our knowledge, this is the first time a wide feature base was used for this type of problem. Physico-chemical properties of the promoter sequences: four features were chosen in the feature selection process: free energy, base stacking, entropy and melting temperature. Using a combination of features for each promoter class as outlined in Supplementary Table S3 , we built 10 NN classifiers, one for each promoter class in E.

Then, we implemented these models into the bTSSfinder program. For each window, position is classified as TSS or non-TSS using the appropriate NN classifier based on a threshold that was predetermined during the training. Predictions that pass the qualifying threshold are labeled as putative TSSs. Depending on user preference, bTSSfinder can report for a chosen phylum: i all predicted TSSs for all promoter classes, ii a user-selected promoter class and iii or the highest scoring TSS.

We tested bTSSfinder on positive and negative sets for every promoter class in E. We observed good performance for all promoter classes in E. The F 1 -score for the remaining cyanobacterial promoter classes ranged from 0. Table 1 Testing results for five sigma classes of E. Test experiments for every sigma class were repeated 10 times for randomly selected negative sets and the means were taken. Testing results for five sigma classes of E. For fairness, we assessed all tools on a single testing dataset.

All other promoter prediction tools that we checked were no longer available. We also tried to test BacPP, which is the first tool that attempted to predict the complete range of sigma promoters in E. The authors reported high prediction accuracy for BacPP, but these results were obtained from a small training and testing sets. Given that we would have to make an educated guess as to where the TSS locations are as well as the shear number of promoters it predicts, we excluded this tool from our comparison.

Our comparison clearly indicates that bTSSfinder has significantly higher prediction accuracy. Table 2 Comparison of available promoter prediction programs tested on E. Prediction is true, if distance between annotated and predicted TSSs is 50 bp or less. Comparison of available promoter prediction programs tested on E. Using short sequences to predict TSSs is not sufficient in evaluating the accuracy and efficiency especially the real false positive rate of a prediction tool.

It should also be tested on longer sequences. In fact, an ideal test should be genome-wide. Nonetheless, genome-wide TSS maps are scarce which renders the task of assessing such predictions unfeasible.

Our results highlight the scale of the problem that researchers encounter when they analyze long sequences. We also investigate if models optimized for E. Results of these cross-phylum experiment are presented in Table 4. However, the opposite scenario had a significant impact on sensitivity Table 4. Cross assessment of the models for the other sigma factors failed to reproduce the sensitivity achieved for their intended species. This perhaps can be explained by: i significant structural differences between promoters in E.

In fact, we tested bTSSfinder and the other three tools on ten other bacterial species belonging to five different phyla: three Firmicutes, four Proteobacteria, one Spirochetes, Chlamydias and one CFB group. For details of this comparison consult the supplementary material Supplementary Table S4. Table 4 Result of cross-phylum application of bTSSfinder on the positive dataset. Bold refers to sensitivity of the models applied to their intended species.

Sensitivity values obtained with the species-specific bTSSfinder parameters are given in bold. Result of cross-phylum application of bTSSfinder on the positive dataset. We observed that some experimentally verified promoters did not pass the prediction thresholds. This may warrant an alternative approach to search for transcription start regions TSRs rather than points.

The scoring landscape of experimentally validated TSSs in E. The promoter prediction problem in prokaryotes is an old problem that has yet to achieve an adequate solution. Available tools tend to produce many false positives or have poor sensitivity, especially when applied to long sequences or whole genomes.

These limitations are probably due to the following challenges: Some in-vitro -strong promoters that are predicted computationally with high score are in fact not used in vivo at all, perhaps due to unknown repression mechanisms Hertz and Stormo, ; Huerta and Collado-Vides, Some predicted TSSs may be evaluated as false positives due to the lack of experimentally-verified, comprehensive and precise TSS maps.

Scarcity of experimental data also means that training models using features extracted from the limited available data would naturally restrict their predictive power. All methods, as far as we know, depend on promoter architecture and other physico-chemical properties in their model building.

The choice of a negative dataset can be detrimental for the trained model since one cannot be certain about the total absence of TSSs in the negative dataset.

Nonetheless, promoter prediction, especially at the whole genome level, remains unresolved and this warrants further investigations in this field. The authors thank Mohamad Jaber for some of the helpful discussions and feedback. Afifi A. Google Scholar. Google Preview. Altschul S.

Nucleic Acids Res. Barnett M. Burden S. Bioinformatics , 21 , — Campagne S. Cardon L. Dartigalongue C. Djordjevic M. Estrem S. Feklistov A. Gordon J. Bioinformatics , 22 , — Gordon L. Bioinformatics , 19 , — Gruber T. Hertz G. Methods Enzymol. Huerta A. Imamura S. Gene Regul. Jihoon Y. IJCNN ' Karp P. Nucleic acids research , 30 , 56 — Kilic S. Knudsen S. Bioinformatics , 15 , — IEEE Trans.

Man Cybern. C Appl. Mann S. Mitschke J. Panyukov V. PLoS One , 8 , e Rangannan V. Rani T. In Silico Biol. Reese M. Roy A. Trends Biochem. Ruff E. Biomolecules , 5 , — Salgado H. Schneider G. Shahmuradov I. Solovyev V. In: Li RW ed , Metagenomics and its applications in agriculture, biomedicine and environmental studies. Song K. Song W. FEMS Microbiol. Stormo G. Bioinformatics , 16 , 16 — Studholme D.

Vijayan V. Genome Biol. Wosten M. Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide. Sequence logo of the natural sequences A and random sequences B predicted as sigma 70 promoters by all tools.

Note that, in the case of natural sequences we observe the same motif pattern that we found in Fig. Fasta sequence of natural promoters. Fasta sequence of random promoters. The promoter region is a key element required for the production of RNA in bacteria. While new high-throughput technology allows massively parallel mapping of promoter elements, we still mainly rely on bioinformatics tools to predict such elements in bacterial genomes.

Additionally, despite many different prediction tools having become popular to identify bacterial promoters, no systematic comparison of such tools has been performed. For this, we used data sets of experimentally validated promoters from Escherichia coli and a control data set composed of randomly generated sequences with similar nucleotide distributions. We compared the performance of the tools using metrics such as specificity, sensitivity, accuracy, and Matthews correlation coefficient MCC.

Of these tools, iProFMWin exhibited the best results for most of the metrics used. We present here some potentials and limitations of available tools, and we hope that future work can build upon our effort to systematically characterize this useful class of bioinformatics tools. Also, when combining new DNA elements into synthetic sequences, predicting the potential generation of new promoter sequences is critical.

Over the last years, many bioinformatics tools have been created to allow users to predict promoter elements in a sequence or genome of interest.

Here, we assess the predictive power of some of the main prediction tools available using well-defined promoter data sets. Using Escherichia coli as a model organism, we demonstrated that while some tools are biased toward AT-rich sequences, others are very efficient in identifying real promoters with low false-negative rates. We hope the potentials and limitations presented here will help the microbiology community to choose promoter prediction tools among many available alternatives.

Thus, the correct mapping of promoters is a critical step when studying gene expression dynamics in bacteria. While the definition of promoters could vary widely, here we will consider promoters as the core elements recognized by the sigma subunit of the RNAP.

In Escherichia coli , seven alternative sigma factors are responsible for gene expression, while sigma 70 is the most important one as it is required for the expression of housekeeping genes 2 , 3. In addition to the core promoter region, other cis -regulatory elements can play relevant roles in the regulation of gene expression 4.

In this sense, the production of RNA at the transcription start site TSS is the result of the interplay between the core promoter region and the cis -regulatory elements 5. Mapping of functional promoter elements have been performed mostly using low-throughput techniques such as promoter probing, primer extension, DNA footprinting, etc. However, the rapidly growing number of fully sequenced bacterial genomes greatly exceeds our ability to map promoter elements experimentally.

Yet, over the past years, a growing number of computational strategies have evolved in complexity. Notable novel approaches raised, such as sequence alignment-base kernel for support vector machine 14 , 15 , profiles of hidden Markov models combined with artificial neural networks 16 , or weighted rules extracted from neural network models Also, new ways to extract information from DNA sequences to perform predictions have appeared.

Thus, there are now several numerical representations of DNA sequences in which each one carries its properties 18 , — 20 , such as methods that use k-mer frequencies or variations 21 , 22 and other methods that include physicochemical properties of DNA Recently, machine learning ML techniques have been used to obtain insight from different sources from diverse biology fields an extensive survey can be seen in Libbrecht and Noble [ 24 ], Camacho et al.

Among most of the ML algorithms used for this purpose, we can mention support vector machine 27 , neural networks 28 , logistic regression 29 , decision trees 30 , and hidden Markov models 31 , Despite the existence of all these modern techniques, promoters cannot always be inferred based on their sequence only, and currently, we have no clue on how efficient these tools are. This occurs since each new tool is validated without the use of standardized data sets or methods, making it difficult to compare novel emerging alternatives with the current state of the art.

In this work, we summarize general aspects of the available promoter prediction tools, exposing comparatively their main strong and weak features. For this, we compared the performance of these tools using experimentally validated promoters from E. Unexpectedly, we show that some very popular tools such as BPROM performed very poorly compared to tools created over the last 2 years. We hope our results can help both community users to choose a suitable tool for their specific applications, as well as developers to construct novel tools overcoming key limitations reported here.

In this section, we present a succinct explanation of each methodology see Table 1 as well as the usability information about their use requirements, acceptable file types, etc. Below, we describe briefly for each tool how they have been built and some of the main features. BPROM 33 was developed as a module of an annotation pipeline for microbial sequences to find promoters in upstream regions of predicted open reading frames ORFs.

To train the model, the authors used a data set of experimentally validated promoters from elsewhere They applied linear discriminant analysis to discriminate between those promoters and inner regions of protein-coding sequences. For attributes, they used five position weight matrices of promoter conserved motifs and they also consider the distance between the —10 and —35 boxes and the ratio of densities of octanucleotides overrepresented in known bacterial transcription factor binding site TFBS relative to their occurrence in coding regions.

This tool is available as a web application, and users can submit a local file or paste the sequence in the web form. It quickly returns the results in the screen with the possible —10 and —35 boxes of predicted promoters and their positions in the submitted sequence. Its positive data set consists of experimentally validated E. Its negative data set consists of genomic regions where there is no experimental evidence for the presence of TSSs. They started with 30 features distributed between these types: promoter element motifs PWMs , the distance between the elements, oligomer scores, TFBS density, and physicochemical properties.

The final set of features was selected by evaluating the predictive power of these features by calculating Mahalanobis distance and used to train a neural network. This tool is available as a web application or as a stand-alone tool for Linux.

On the website, an e-mail is needed to login and the results are saved for a week. BacPP 17 is a prediction tool to find E. For a positive data set, the authors used promoter sequences from Regulon DB for six different sigma factors in E.

Each nucleotide of these sequences was transformed into binary digits and used to train neural networks. To use this tool, the user must create a login in the website, then paste the sequences or fasta file according to their model, and select the sigma factors of interest. CNNProm 34 is a web tool that can predict prokaryotic and eukaryotic promoters from big genomic sequences or multifasta files. In the case of E. Each of these sequences was transformed into a binary four-dimensional vector and used directly as features to train a convolutional neural network.

To use this predictor, users must enter the sequences or the file on the website and choose the organism model. The image generation and selection are conducted by applying an evolutionary approach and calculating the similarity of these images in a set of E.

The authors measured the accuracy of the tool by analyzing the set of promoters and protein-coding sequences. To use this software, it is necessary to download the executable files, execute the evolutionary algorithm with the promoters of interest, and then implement the classifier software, which uses the resulting model generated in the previous step.

Virtual Footprint 36 is a web framework for prokaryotic regulon prediction. To make the prediction, it is necessary to upload a DNA sequence or a fasta file, select different PWMs for core promoter elements or other transcription factor binding sites, and set some parameters.

Its training data set consists of sigma 70 promoter sequences from data set Regulon DB 9. These features include, for example, different kinds of k-mer and g-gapped k-mer compositions and statistical and nucleotide frequency measures. Among the machine learning methods tested by the authors, logistic regression achieved better results.

They also applied the AdaBoost technique for feature selection to improve prediction. The attributes generated from the sequences were position-specific trinucleotide propensity and electron-ion interaction pseudopotentials of nucleotides, considering single- or double-stranded DNA, to reveal trinucleotide distribution differences between the samples and represent the interaction of trinucleotides, respectively. For model training, the authors used experimentally confirmed promoter sequences from Regulon DB 9.

It is important to emphasize that sequences with more than 0. Their feature extraction was based on multiwindow-based pseudo K-tuple nucleotide composition, which consists of a sliding window, extracting and encoding physicochemical attributes of different regions of a given sequence.

To train their model, the authors used experimentally validated promoter sequences from Regulon DB for all type of sigma factors in E. Their feature extraction was divided into two types; the first one was used to represent global features, applying biprofile Bayes and KNN k-nearest neighbor features, and the second one was used to represent local features, applying k-tuple nucleotide composition sequence-based feature and dinucleotide-based auto-covariance which considers physicochemical properties.

This method also performs two steps of classification: first, it resolves whether a given sequence is a promoter or not, and then it decides to which class of sigma promoter it belongs. The authors used the SVM method for classification and the F-score method for feature selection. In order to compare the performance of the promoter prediction tools presented above, we analyzed the positive and negative data sets as described in Materials and Methods.

From the 10 algorithms selected, BacPP could not be tested with our entire data set, because multifasta files were not supported, and Virtual Footprint produces a large number of predicted —10 boxes for sigma 70 in both positive and negative data sets, a number that greatly exceeds the number of sequences analyzed. Thus, these two tools were not considered in further analyses. The best performance was observed for CNNProm From the analysis presented in Fig. Analysis of the performance of promoter prediction tools.

A Percentage of sequences predicted as sigma dependent promoters in both data sets. The percentage of correct classifications of experimental promoters blue and the percentage of misclassified random sequences gray are presented.

The vertical dashed line separates the five best tools from the three worse tools analyzed. B Metrics used to evaluate the performance of the tools. It is important to emphasize that two tools presented the highest sensitivity associated with low specificity, i.

The vertical dashed line divides the four best tools from the four worse tools. Next, we performed a hierarchical clustering analysis using the results from the five tools that presented the best results. As can be seen in Fig.

In general, sequences When we analyzed the negative data set constructed with random sequences , we do not observe a clear clustering since each tool presented a different level of FP results, with the lowest level observed for iProFMWin Fig. In this case, only sequences It is worth mentioning that the three best tools CNNProm, iProFMWin, and 70ProPred are from to , indicating that, as expected, promoter prediction algorithms are evolving through the years.

Taken together, these results indicate that four out of eight tools analyzed here display equivalent predicting power to identify true promoter sequences, while the widely used tool BPROM exhibits a reduced predictive capability. Analysis of tool performance in the positive data set natural sequences.

A Hierarchical clustering of DNA sequences classified as promoters blue or nonpromoters black. B Venn diagram representing the number of sequences predicted as promoters from panel A. Analysis of tool performance on the negative data set random sequences. Hierarchical clustering of DNA sequences classified as promoters blue or nonpromoters black. As presented above, we observed a high degree of similarity between the best tools for the identification of true promoters, but a lower overlap on random sequences equivocally classified as promoters.

This could indicate that each algorithm might identify different features to assign a sequence as a promoter. To further investigate this process, we analyzed the information content from the sequences identified as promoters from the positive and negative data sets for the top five tools analyzed here.

The results of these analyses are presented as sequence logos in Fig. It is worth noticing that the information content was higher for iProFMWin up to 0. S1 in the supplemental material. This implies that these tools are sensitive to AT content, which makes sense since iPromoter-2L and CNNProm were trained on coding sequences as negative controls 34 , This might be explained by these tools classifying sequences that resemble true promoters, and we could not rule out the possibility that some of these random sequences could in fact display promoter activity in E.

Taken together, these results indicate that high rates of FP results observed for some of these algorithms could be due to the use of unrealistic control sequences such as coding regions that could make the algorithms sensitive to AT-rich regions, highlighting the importance of choosing appropriate nonpromoter sequences to train these tools.

Analysis of the information content of DNA sequences identified as promoters on the positive data set natural sequences. Analysis of the information content of sequences identified as promoters on the negative data set random sequences.

In this work, we performed a benchmark analysis of the performance of promoter prediction tools using a well-characterized promoter sequence and random sequences. As can be seen from the results above, new tools have emerged with enhanced performance compared to widely used ones.

Although the best performing tool uses just sequence-based features a result that corroborates with Abbas et al. It is also clear from our results that choosing the appropriate control or negative data set to construct these algorithms is crucial to avoid false-positive results. Therefore, coding sequences or sequences with different features render the tools AT sensitive, increasing the false-positive rate. Furthermore, we still need an experimentally well-validated nonpromoter data set to faithfully use as negative controls in these predictions, but these sequences are not available yet.

In this sense, we expect that the growing number of high-throughput experiments could become a great source of data to create novel data sets to train new tools for promoter prediction in the future. Another complication to this subject comes from recent evidence showing that just one mutation in random sequences could lead to constitutive transcription in vivo , indicating that transcription is indeed a robust process Additionally, several sources of prior information could be incorporated into prediction methods to improve the final tools.

For instance, the interrelation between the UP upstream promoter element and a subunit of RNAP was found to play a role in transcription initiation and promoter activity 44 and switch preference of sigma factors in promoters Additionally, more than proteins in E. These proteins could thus impact promoter activity in vivo , and their binding sequence preferences could influence promoter discovery.

A putative model for a bacterial promoter region, including a range of experimental attributes. These regions can have positive blue regions or negative red regions effect on promoter activity. A notable characteristic shared by the works mentioned here is that all the available prediction tools perform only binary classifications, i.

Therefore, there is no indication of an activity threshold to classify a given sequence as a promoter, and it is known that expression levels of different bacterial transcripts vary on a wide range of magnitude order However, there have been some attempts in the literature to perform some regression analysis instead of binary classification only.

They also added variance in sequences that surround the core promoter and that may play a role in promoter activity. Performing a fluorescence assay to measure promoter activity and applying a partial least squares regression model, they attempt to predict promoter strength for sigma Yet, only 78 variants were characterized in this experimental design, and more variants are needed to train an accurate model. Similarly, Rhodius et al. Also, a spacer and discriminator length penalty score was added.

Notably, in vivo and in vitro expression was measured in their work, and promoter activity was also tested by a function of sigma E concentration.

Additionally, partial least squares regression was used to predict promoter activity This approach is useful to find the elements in a given promoter sequence, and by using cross-validation, the results appear to be promising, despite the small size of the data set and the use of PWM a model that showed poor predictive results on our work. Moreover, instead of using position weight matrices, energy matrices are being successfully built to represent the sequence-dependent binding energy using sequence libraries with a large number of variants followed by Sort-seq experiments flow cytometry, sorting, and next-generation sequencing 6 , and therefore, these energy matrices are being employed to model promoter activity Urtecho et al.

The authors have integrated their expression cassette on different genomic locations and have investigated its effects, applying a well-suited method for expression normalization. Therefore, their approach explained most of the variance in promoter activity, as well as discovered nonlinear interactions between promoter elements by employing neural networks Despite the limitation of a data set with discrete characteristics, this approach presents a reliable method to predict promoter strength in a well-defined context, but application of these methods to natural systems has still to be demonstrated.

One final remark is that the majority of algorithms have been created using data sets of promoters from just one bacterium, E. Consequently, since each organism has its particularities in terms of DNA binding proteins and sigma factor elements, we are still far away from having a prediction tool that can be used for several organisms.

To accomplish that, we would require extensive promoter data sets from several microorganisms to construct multipurpose prediction tools. Last, we hope the approach and metrics used here can contribute to future studies aimed to construct improved promoter prediction tools. We started this work by searching in the literature for recent and available prediction tools for E.

For each case, when a tool was available online or by software download, we selected it for posterior analysis. Table 1 shows the summarized information about the tool methodology i. All these descriptions have been extracted from the original papers describing the tools. Next, we analyzed some usability features of the tools such as the file format accepted as input, maximal allowed file size, the output format of results, etc. Then, we selected the ones that accepted our complete data set in multifasta format as input to perform a comparative analysis.

To compare each selected tool, we used an experimentally validated promoter data set for the well-studied E. We used only sigma dependent promoters since they are mostly well-characterized in bacteria, and consequently, most tools have been developed to recognize this class of elements. Additionally, we used a negative promoter set consisting of 1, randomly generated sequences with a nucleotide distribution similar to that encountered in the natural sequences, which was constructed with an ad hoc script written in Python.

Also, it is important to stress that many tools, such as BPROM, 70ProPred, and iProFMWin, used coding and intergenic regions as control negative sequences, but this is not appropriate since coding and noncoding regions have different nucleotide compositions and structural properties 55 , In the case where the tool required the entire genome, we used the E.

The two data sets natural and random used here are available as Data Sets S1 and S2 in the supplemental material. Then, we obtain a random number between 0 and 1 and check the interval this value belongs to pick a given random nucleotide. Thus, the results were evaluated comparing the accuracy and Matthews correlation coefficient MCC 57 , calculated as the following equations:.

We adopted MCC because it is a metric that deals with unbalanced data sets i. It achieves high scores only if TP and TN are high, considering both types of correct classification in a single metric, and it has been shown that for this type of binary classification e. Sensitivity and specificity scores were also used to give a sense of correct classification of promoters and nonpromoters and are defined as follows:.

By testing the tools with our synthetic random data set, we can measure whether those tools are overfitting their test data sets, and by testing our positive data set with strong experimental evidence , we are measuring underfitting, once some of our positive sequences probably have already been used to train the tool's algorithms As some of the tools also predict promoters for other sigma factors, to be able to classify all predictions as correct or wrong, we considered random sequences classified as any sigma class promoter as FP and a sigma 70 sequence classified as any other class of sigma promoter as FN.

This does not mean that a sigma70 promoter classified as another sigma factor cannot respond to this sigma or even to sigma 70, in vivo , as we discuss later. For data representation, heatmaps were created by using the R package Heatmap.

The logos of count matrices, probability matrices, position weight matrices, and information matrices were constructed by using Logomaker Python library As every result generated by the tools has different formats, these were preprocessed using a text editor or ad hoc Python scripts. The data sets used are available for download as files in the supplemental material. All authors read and approved the final manuscript.



0コメント

  • 1000 / 1000