
UNESP 2015 - 2017
Project Overview
Metagenomics analysis has underscored the critical role of microbial communities in various environments. It enables the sequencing and analysis of genetic material from microbial communities without the need for microorganism culture. Consequently, it is crucial for revealing their diversity and function, providing a comprehensive understanding of the entire ecosystem.
A metagenome is a dataset resulting from DNA sequencing of entire microbial communities. It consists of various DNA sequence segments of varying sizes, depending on the sequencing technology used, originating from the diverse microbes present in the sample.
Ideally, these fragments would be assembled to recover the entire genomes of the different microbial species. However, even identifying the origin of sequences is a highly challenging task. Consequently, bioinformatics developers have been making significant efforts to create better tools to optimize this process.
Over the past decade, machine learning (ML) approaches have been applied across various fields, including metagenomics. The capability of ML algorithms to learn complex patterns, generalize information, automate processes, and handle high-dimensional data offers numerous advantages over traditional methods.
Nevertheless, challenges such as the variability and sparsity of microbiomes require special attention when using ML approaches. Thus, appropriate normalisation techniques and careful selection of features are essential for successful analysis.
Between 2015 and 2017, when this project was undertaken, machine learning was beginning to be applied to metagenomic analyses, addressing the challenges posed by the sequencing technologies of that time.
Today, both tools and sequencing technologies have evolved. Computational power and storage capacity have increased, and sequencing methods now deliver analyses with fewer errors and longer sequences, significantly enhancing the robustness of analyses.
The primary goal of this project was to define features for classification in a machine-learning model aimed at developing a tool. A commonly used set of features for this purpose is compositional analysis, which uses the intrinsic pattern of a sequence’s nucleotide composition to identify its origin. In organisms, these characteristics are shaped within each taxonomic group over the course of evolution, influenced by evolutionary pressures. Therefore, phylogenetically closer species exhibit more similar nucleotide compositions.
In this analysis, we evaluated traditional parameters such as GC content and oligonucleotide frequency (k-mers), as well as less common parameters such as dinucleotide abundance, k-entropy, and TETRA (tetranucleotide frequency with bias discrimination) in machine learning to identify which parameter performs best in classifying sequences at the phylum level.
For developing the model, the machine learning algorithm support vector machine (SVM) was utilized.
Identifying effective features for machine learning classification of metagenomic reads is crucial for developing robust tools that enable researchers to analyze microbial communities comprehensively. This step is pivotal in advancing our understanding of these dynamic ecosystems and fostering the development of knowledge and technologies across diverse fields such as agriculture, health, and ecology. The following sections will provide detailed insights into the methodology and results of this study.
References: Achudhan et al., 2024 ; Papoutsoglou et al., 2023 ; ; Mathieu et al., 2022 ; Teeling et al., 2004

Workflow
The SVM algorithm and pre-processing codes were developed using Mathematica 8.0, using Wolfram Language. In the next weeks, programming codes will also be integrated into the project.
Given that the aim of this analysis is to explore the performance of features, binary simulations were employed instead of multiclass simulations.

Figure 1: Flowchart of the entire workflow process. Click on the boxes to navigate to the corresponding session.
1. DATA PRE-PROCESSING
The initial phase of the project involves data preparation for input into the SVM algorithm. In this analysis, our objective is to investigate the performance of various features concerning sequence composition. Consequently, a simulated metagenome was generated to facilitate controlled analysis.
1.1. Data Acquisition and Taxonomic Organization
To create the simulated metagenome, we initially sourced microorganism complete genomes from the NCBI GenBank database (May 2015), encompassing taxa from Archaea, Bacteria, and Fungi.
Bacteria’s genomes were categorized into phyla based on the NCBI taxonomic classification (May 2015). The Proteobacteria phylum, acknowledged for its extensive species diversity, was further subdivided into the classes Alphaproteobacteria, Betaproteobacteria, Deltaproteobacteria, Epsilonproteobacteria, and Gammaproteobacteria.
Due to the small variability of phyla, Fungi and Archaea sequences were classified only at the "Fungi" and "Archaea" levels. Phyla with fewer than 2 available species were excluded.
The taxonomic classification is shown in the adjacent table.
Table description: delineating the taxonomic divisions for analysis.
Taxa effectively utilized are highlighted in dark grey cells.

1.2. Genome Fragmentation
The genomes obtained were randomly fragmented into sizes of 100, 400 and 1000 base pairs, ensuring that there was no overlap between them.
These sizes were chosen to represent the different second-generation sequencing technologies that were most commonly used at the time of the project (2015 to 2017). They included Illumina (with reads approximately ranging from 100 to 150 bp), Ion Torrent (with reads around 400 bp), and Roche (with reads up to 700 bp).
These sequence lengths pose an additional challenge when analysing metagenomic data generated by second-generation sequencing: the smaller the size, the harder is to achieve precise identification. Therefore, it's a common practice to include a pre-assembly step to make the sequences longer, which increases the time and complexity of analysis. Hence, the objective of testing features with shorter read sizes is to determine which parameters are most effective for metagenomic identification without requiring pre-assembly.
References: Morozova and Marra, 2008 ; Thomas et al, 2012 ; Lee et al, 2016
1.3. Feature Extraction
Features are specific measurable characteristics extracted from data, used for classification. One of the most common types of features for taxonomic classification of sequences is genomic signatures.. These reflect the compositional patterns of sequences in terms of nucleotide frequencies and positions along the genome. These patterns are shaped by evolutionary pressures, so their similarity among species correlates with taxonomic proximity. Providing a set of taxonomically labeled features as training data to the machine learning algorithm allows it to identify patterns for accurate sequence classification. These features were extracted from the fragmented sequences prepared in the previous preprocessing step.
Here, we compared 9 parameters to the reads classification. Each one is detailed below.
1.3.1. GC Content
The GC content, or the percentage of guanine and cytosine in a sequence, is a traditional genomic signature used for various types of sequence characterization. This include identifying species, understanding evolutionary relationships, and predicting gene expression levels.
The evolutionary significance of GC content stems from the high stability provided by the three hydrogen bonds in G-C pairing, in contrast to the single bond in A-T pair. This stability is an advantage for organisms inhabiting extreme environments, but can be a drawback for fast replication, such as in viruses. Moreover, some pathogenic viruses, such as H1N1, avoid immune detection by adopting a similar the GC content to that of their hosts.
Consequently, these particular adaptations in microorganisms shape their patterns, becoming a suitable parameter for taxonomic identification.
References: Campbell et al, 1999 ; Gerhardt et al, 2013 ; Greenbaum et al, 2008 ; Passel et al, 2006
1.3.2. K-mers
The k-mers feature measures the frequency of each oligonucleotide in a DNA or RNA sequence. Oligonucleotides are sequences of length k and comprise a number of nucleotide combinations based on their k-value. For instance, 2-mers are oligonucleotides of length 2. Therefore, the total number of possible 2-mers combinations is 16: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, and TT.
These parameters are widely used in alignment-free taxonomic classification algorithms, with different k values depending on the dataset. This is due to their significant variability among different organisms. Like GC content, k-mers are subjected to evolutionary pressure for influencing factors as DNA stability, methylation modification and repair and replication mechanisms. Additionally, their consistent values throughout the genome make them suitable for identifying metagenomic reads.
To extract k-mer feature from the sequence submitted to classification, the frequency of each nucleotide is calculated. Consequently, the k-mer is not a single value, but of a vector of values of length n. In our project, we chose to use 2, 3 and 4-mers for analysis, resulting in 16, 64 and 256 combinations, subsequently. We selected short k-mers to be compatible with the sequence length of non-assembled reads that we intend to classify.
References: Karlin et al, 1997 ; Takahashi et al, 2009
1.3.3. Dinucleotide Abundance
Similar to the former features, the frequency of dinucleotides also significantly influences the structure of the DNA molecules. This influence can impact how DNA interacts with proteins, especially those involved in replication and translation processes. These mechanisms are under strong selective pressure, which highlights patterns that differ among species and organisms adapted to diverse environments, thereby enhancing their effectiveness as a parameter for identifying sequences.
The most commonly used measure to classify sequences is the relative abundance of dinucleotides. It is represented by the following equation:
where Fxy represents the frequency of the dinucleotide XY in the sequence, Fx represents the frequency of nucleotide X, and Fy the frequency of nucleotide Y.
This equation yields the difference between the observed frequency of the dinucleotide XY in the sequence and its expected frequency in a random sequence with the same quantity of nucleotides X and Y.
Here, we used the dinucleotide abundance value by averaging the relative abundance of each dinucleotide, obtaining a single value for the feature. It is represented by the following equation:
References: Dutta and Paul, 2012 ; Karlin and Burge, 1995 ; Karlin, 1998


1.3.4. K-entropy
The concept of “entropy” implies disorder or uncertainty in a system. Entropy as a metric quantifies inherent disorder or uncertainty within that system. In our study, entropy reflects the level of disorder in the distribution of oligonucleotides within a sequence. We utilized entropy measures for dinucleotides (S2), trinucleotides (S3), and tetranucleotides (S4).
To calculate entropy, it is necessary to first estimate the probability of the oligonucleotide combinations in the sequence (Pn). This value is obtained by counting the number of times each oligonucleotide appears in a sliding window within the sequence and dividing this number by the total number of nucleotides in the segment.
Entropy calculation follows Shannon's entropy equation, shown below:
where Pn denotes the probability of the n-th oligonucleotide present in a sequence of length W.
N represents the number of possible oligonucleotides, which is 16 for dinucleotides, 64 for trinucleotides, and 256 for tetranucleotides.
References: Gerhardt et al, 2011 ;

1.3.5. TETRA
As discussed earlier, species-specific patterns can be found in DNA frequencies of microorganisms. Among various sizes of oligonucleotides from which frequencies can be calculated, tetranucleotides (size 4) are suitable for identifying genome patterns by balancing appropriate resolution with computational efficiency. However, the original value of tetranucleotide frequency, earlier referred as 4-mer, is influenced by biases originating the smaller oligonucleotide frequencies (mono-, di-, and trinucleotides). To address these biases, the TETRA method was developed.
To calculate TETRA, the observed frequencies of all 256 possible tetranucleotides are first computed along with their corresponding expected frequencies. The differences between expected and observed values are converted into Z-scores for each tetranucleotide. Denoting N(n1n2n3) as the frequency of a tetranucleotide, the z-score equation is represented below:
considering that:
References: Teeling et al, 2004


2. MODEL TRAINING AND TEST
In this phase, we proceed with training and testing the SVM model using the prepared data. We focus on evaluating feature performance in sequence taxonomic classification. For this phase, we have separated our dataset into distinct groups to systematically assess the model's ability to classify sequences accurately.
2.1. Support Vector Machines
For developing the model, a support vector machine (SVM) was employed. SVM are supervised machine learning algorithms widely used for their ability to generalize well and handle large datasets efficiently. SVMs are commonly used in text categorization, image analysis and bioinformatics.
SVM classify data by mapping the labelled training data onto a hyperplane. They calculate distances between classes distance and optimize the hyperplane by maximizing the margin, using a suitable loss function to determine the best position for class discrimination. When unknown data are introduced, they are also positioned in the hyperplane and classified on their location in relation to the mapped regions.
This approach was chosen due to its capability to manage high-dimensional and non-linear data, consider interactions among data inputs, and its efficiency in execution.
References: Kunin et al, 2008 ; Liu et al. 2013 ; Kapetanovic et al, 2004; Lorena and Carvalho, 2007

Figure 2:
SVM separates classes using a hyperplane in the most efficient manner possible. The figure illustrates three examples of lines dividing the classes on the plane. Line H1 does not separate the classes and is unsuitable for prediction. Line H2 separates the classes, but with a narrow margin making it inefficient. Line H3 is the most efficient as it maximizes the distance between classes, making it optimal for classification.
Source: “File:Svm separating hyperplanes.png,” Wikimedia Commons, the free media repository, 19 September 2020 09:26 UTC
2.1. Simulation Groups
To assess feature performance under different data conditions, we categorized our dataset into three distinct groups for algorithm training and testing. This distribution is illustrated in the figure on the beginning of the Workflow section.
-
Group 1: Positive training data consisted of Gammaproteobacteria sequences, with Escherichia coli sequences used as positive test samples. Negative training and test sets comprised sequences randomly selected from microorganisms outside the Gammaproteobacteria class.
-
Group 2: Similar to Group 1, positive training data were Gammaproteobacteria sequences, and the positive test set included different Gammaproteobacteria sequences from those used in training. Negative training and test sets were selected from non-Gammaproteobacteria microorganisms.
-
Group 3: Positive training data comprised Gammaproteobacteria sequences excluding E. coli, while E. coli sequences were used as positive test samples. Negative training and test sets were again chosen randomly from microorganisms outside the Gammaproteobacteria class.
Groups 1 and 2 aimed to evaluate the classification accuracy for known species present in the dataset. Group 3 simulated the presence of an unknown catalogued microorganism to assess feature accuracy in correctly classifying it.
Each group utilized 500 sequences for positive and negative training and testing, totalling 2000 sequences per group.
2.2. Kernel Decision
In SVM, we can simplify the concept by imagining that the training data points are plotted on a plane based on their assigned values. SVM then draws a line to separate these points into different classes, allowing new data points to be classified based on their similarity and position relative to this line.
When a straight line can effectively classify the data, it's known as linear classification in SVM. However, in cases where a more effective separation between classes requires flexibilizing this line, different equations known as kernels can be applied. The figure 2 illustrate this concept.
There are various types of kernels such as linear, Radial Basis Function (RBF), polynomial, and sigmoid, each suited to different types of data. To determine the optimal kernel for our data, we conducted tests using Group 1 (as mentioned previously) for algorithm training and testing.
References: Cristianini and Shawe-Taylor, 2000

Figure 3:
The figure illustrates the flexibility of an SVM kernel, showcasing its ability to adapt and create non-linear decision boundaries to better fit the data.
Source: “Kernel Machine.” Wikimedia Commons, 12 September 2020. Retrieved 17 June 2024
2.3. Feature Evaluation
To assess the impact of each feature on the prediction, iterations were conducted by omitting one feature at a time. A simulation that included all features was also performed to serve as a control. In the case of k-mers, which consist of a vector with values for all possible nucleotide combinations,the entire vector was omitted.
Considering the exclusion of nine features in each iteration, plus one iteration including all features, we planned 10 simulations. These were conducted for each of the three groups and three sequence sizes, resulting in 90 iterations. For each of these 90 iterations, a repetition of 120 simulations was performed.
2.4. AUC for Classifier Efficacy
The AUC (Area under ROC curve) value serves as a criteria to measure the classifier’s efficiency. The ROC (Receiver Operating Characteristic) curve is a graph where the Y-axis represents the true positive rate, and the X-axis contains the false positive rate. The values obtained in the iterations are plotted in the chart, forming the ROC curve.
This statistical measure is widely used to evaluate the machine learning classifier’s accuracy because it balances the trade-off between benefits and costs associated with each classifier. The AUC value quantifies the ROC curve, facilitating comparison among different classifiers.
References: Fawcett, 2006 ; Hajian-Tilaki, 2013
RESULTS
1. KERNEL DECISION
To evaluate the performance of different kernels in the SVM model, a series of tests using the Group 1 dataset were conducted. The analysis compared the classification AUC (Area Under the Curve) for each kernel type. The results are summarized in the table and graph below.

Figure 4:
AUC distribution for Radial Basis Function (RBF), linear, polynomial, and sigmoid kernels shows that the RBF values are higher than the others, indicating its higher suitability compared to the other kernels.
According to the graph data, we observed that the RBF kernel achieved a higher AUC value compared to the Linear, Polynomial, and Sigmoid kernels. This suggests that SVM iterations using the RBF kernel correctly classify more sequences than simulations with the other kernels.
To assess the significance of these differences, a Mann-Whitney test was performed and is presented in the table below. This test was selected due to the independent and non-parametric nature of the data. Differences were considered significant when the p-value was less than 0.05
Table 2: The comparison of AUC values among different kernels was performed using the Mann-Whitney test to evaluate if the prediction results differ significantly. Differences were considered significant when the p-value was less than 0.05. Grey cells indicate significant differences, while white cells indicate non-significant differences.

Based on the statistical analysis, the RBF kernel demonstrated better performance with significantly higher AUC values compared to the other kernels. The Mann-Whitney test confirmed these differences as statistically significant. Therefore, the RBF kernel was selected as the most suitable choice for our SVM model, ensuring the highest accuracy in sequence classification.
Therefore, the RBF kernel was used for all subsequent simulations.
2. FEATURE EVALUATION
To evaluate how each feature influences prediction accuracy, we executed simulations excluding one feature at a time for all sequence sizes. Then, we calculated the AUC values of the results and plotted them in Figure 5.

Figure 5: AUC distribution for the three groups and three sequence sizes of simulations that excluded one excluding one feature at a time. +All means simulations with all features. GC = GC content; AB = Dinucleotide Abundance; E2 = Dinucleotide entropy; E3 = Trinucleotide entropy; E4 = Tetranucleotide entropy; K2 = 2-mers; K3 = 3-mers; K4 = 4-mers; Te = Tetra. Graphs plotted with Mathematica 8.0.
According to the graphs in Figure 5, Group 1 exhibited overfitting, demonstrated by the results showing nearly 100% correct predictions. This could be because Escherichia coli, with a substantial amount of available genome data, represents a significant portion of the genomes from Gammaproteobacteria. Therefore, when the algorithm selects a random genome to train the positive Gammaproteobacteria, there's a high chance of including many E. coli sequences, leading to overfitting. Consequently, we will only consider the results of Groups 2 and 3.
The results from iterations using Groups 2 and 3 are more appropriate for drawing conclusions about the influence of features on sequence classification. In these groups, we observed a decrease in classification success when 4-mers, and particularly TETRA, were excluded from the feature pool. This indicates that these features are optimal for classifying sequences of all tested sizes. The exclusion of TETRA had the most significant impact. In the 100 bp simulation, excluding TETRA reduced prediction accuracy from a median AUC of approximately 0.85 and 0.77 to 0.54 and 0.48, representing a decrease of 34% and 23% in accuracy for Groups 2 and 3, respectively. Similar trends were observed for sequence sizes of 400 bp and 1000 bp, where the exclusion of TETRA showed a less pronounced but still noticeable decrease in accuracy.
In contrast, excluding Dinucleotide Abundance and Diplet Entropy resulted in increased classification success, mainly for 400 bp sequences. Dinucleotide Abundance exclusion raised the median AUC from approximately 0.85 to 0.88 for Group 2 and from approximately 0.79 to 0.83 for Group 3.
Table 3: Significance of results using the Mann-Whitney test for 400 bp sequences. The test compared simulations excluding specific features from the feature set to simulations including all features in SVM iterations.. ✘ indicates a significant increase in prediction accuracy due to feature exclusion, while ✔ indicates a significant decrease. Therefore, ✘ suggests unfavorable features, whereas ✔ suggests favorable ones.

To evaluate significant differences, we performed the Mann-Whitney test for 400 bp on each excluded feature simulation and compared it against the simulation that included all features. AUC differences with P-values lower than 0.05 were considered significant.
This test confirmed the negative influence of Dinucleotide Abundance and Diplet Entropy for Groups 2 and 3, while highlighting the positive influence of 4-mers and Tetra for both groups, as well as GC content for Group 2 and 3-mers for Group 3. These features are essential for accurate prediction under these conditions, as illustrated in the Table 3.
DISCUSSION
Binning classifiers for metagenomes are constantly evolving to find a balance between accuracy and computational efficiency. Alignment tools, requiring higher computational resources, offer precise results at genus and species levels. In the other side, compositional classifiers frequently use machine learning to taxonomically classify reads based on sequence patterns. They offer faster processing but are typically limited to higher taxonomic levels. Both approaches face the challenge of classifying unknown microorganisms lacking sequence data, leading to potential false negatives.
In our study using Group 3, we addressed this challenge by excluding E. coli from training, simulating its unavailability in the database. Notably, this exclusion did not significantly impair the accuracy of evaluated features. Our results demonstrate that uncataloged microorganisms can still be classified at higher taxonomic levels, especially when Tetra is included while dinucleotide abundance is excluded.
However, Group 1 exhibited overfitting due to the high abundance of complete genomes from model organisms like E. coli within Gammaproteobacteria. Randomly selecting sequences primarily from E. coli genomes reduces the model's ability to generalize Gammaproteobacteria patterns, resulting in overfitting. To mitigate this, future phases of our project will refine training data selection to ensure sequence diversity and incorporate non-model species for comparative analysis across diverse microorganisms. Additionally, this project's initial phase aims to evaluate features for subsequent tool development, emphasizing the importance of a multiclass approach to align results with real metagenomic applications.
Additionally, using real-life metagenome would be essential to confirm the results of this test, given the various challenges involved, such as data sparsity, difficulty in achieving high coverage, variations in taxonomic abundance among different samples, complex interdependent structures, and other intrinsic technical challenges. These factors collectively shape the nature of the analysis and could potentially influence the relevance of the features identified.
Regarding feature analysis, TETRA and 4-mers demonstrated significant positive influence, whereas dinucleotide abundance and entropy harmed sequence classification. Shorter k-mers impair classification not only because of their low resolution and information content, but also by k-mer homoplasy where identical k-mers arise independently in different organisms due to chance rather than shared ancestry. Shorter k-mers are more susceptible to homoplasy due to their higher frequency and probabilistic nature along genomes.
In contrast, longer k-mers such as TETRA and 4-mers positively impacted classification by containing more taxonomic information and reducing homoplasy. Optimal k-mer lengths typically range from 8 to 14 base pairs in free-alignment tools for sequence classification, balancing information content with sequence length considerations. However, longer k-mers increase data dimensionality and may lead to overfitting due to the short length of sequences that we are classifying, necessitating careful selection of k-mer length.
Interestingly, the 256-word k-mer size was appropriate as a feature, given its significant impact on classification success. Performing a specific analysis to discern the individual influence of each oligonucleotide would be beneficial. This analysis would enable us to further explore the feature's potential and reduce dimensionality. Furthermore, it is noteworthy that TETRA's relevance, represented by a single value, was higher than that of 4-mers, which consists of a vector of 256 values and thus provides more information. This highlights the importance of the normalization applied during TETRA calculation for its success as a parameter, demonstrating that the number of features was not the decisive factor in the success of 4-mers.
As mentioned earlier, this project was conducted between 2015 and 2017, a period when the popular sequencing technologies produced shorter reads, presenting challenges for classification without pre-assembly. Today, third and fourth-generation sequencing technologies such as PacBio and Oxford Nanopore have gained popularity, resulting in longer sequenced reads ranging from 10 thousand to over 100 thousand base pairs. Therefore, it would be beneficial to include longer sequences for classification, enabling the use of longer k-mers and entropy oligonucleotides as features.
References: Bussi et al, 2024; Zielezinski et al, 2017; Papoutsouglu et al 2023
Conclusion
Finally, the methodology of feature exclusion was demonstrated to be suitable for evaluating their impact on taxonomic classification of the sequences. By systematically excluding different features such as Dinucleotide Abundance and Diplet Entropy, we could assess their influence on classification accuracy across various sequence lengths. This approach not only highlighted the critical role of longer k-mers like TETRA and 4-mers but also underscored the challenge of dealing with database biases and unknown microorganisms. Moving forward, refining these methodologies could enhance the robustness and applicability of sequence classification in real-world metagenomic analyses.