Tesi etd-09052024-111807
Link copiato negli appunti
Tipo di tesi
Dottorato
Autore
KHAN, LAREB ZAR
URN
etd-09052024-111807
Titolo
Optimized Machine Learning Techniques for Reliability of Optical Networks
Settore scientifico disciplinare
ING-INF/03
Corso di studi
Istituto di Tecnologie della Comunicazione, dell'Informazione e della Percezione - PHD IN EMERGING DIGITAL TECHNOLOGIES
Commissione
relatore SAMBO, NICOLA
Membro Prof.ssa BOGONI, ANTONELLA
Presidente Prof. ESPOSITO, FLAVIO
Membro Dott. PEDRO, JOAO
Membro Prof.ssa BOGONI, ANTONELLA
Presidente Prof. ESPOSITO, FLAVIO
Membro Dott. PEDRO, JOAO
Parole chiave
- Failure Management
- Optical Networks
- Machine Learning
- Neural Networks
- Computational Complexity
- Data Imbalance
Data inizio appello
12/11/2024;
Disponibilità
completa
Riassunto analitico
The increasing reliance on data-intensive services is pushing the boundaries of capacity that can be achieved from optical networks. As optical fibers can provide high throughput over longer distances, optical networks employing optical fibers form the backbone of modern telecommunications infrastructure. Network operators strive to provide their customers with high-quality online experiences while adhering to service level agreements (SLAs). Their utmost desire is to efficiently make their optical networks more autonomous, which can be accomplished by minimizing human intervention in network operation. In this regard, machine learning (ML) is expected to play an important role. Over the past several years, ML has been studied with the goal of achieving autonomous operation and management of optical networks. However, despite numerous studies demonstrating the exceptional potential of ML for a variety of applications, we have yet to see a mature commercial ML solution that is widely used in optical networks. This thesis intends to investigate some of the overlooked aspects of ML in the context of optical network failure management (ONFM), with the goal of bridging the gap between research and practical application in industry.
The thesis starts with the focus on the quality of training data, which is critical for achieving optimal performance of ML models. Within quality of training data, the primary focus of this thesis is on reducing the imbalance in training data for classification problems (e.g., type of failure classification) to prevent the ML model from being biased towards the well-represented scenario(s) in the data. A variational-autoencoder-based technique has been proposed for data augmentation, which involves the generation of synthetic samples to reduce imbalance in the training data. The proposed approach has been shown to be effective in reducing the training time of ML models, specifically neural networks (NNs), for failure detection and identification. Other demonstrated benefits include reduced computational complexity and improved classification accuracy.
Next, considering the significance of data imbalance and its relevance to ONFM, a comprehensive comparative analysis of various model-centric and data-centric techniques for addressing this issue has been conducted. The dataset used for this comparison has been obtained from an experimental testbed containing commercial devices. The effectiveness of model-centric and data-centric approaches has been compared in terms of their impact on classification performance and additional computational cost associated with each of these techniques. Failure identification has been considered as a use-case for this analysis.
Following this, another very important yet overlooked performance aspect of ML models: computational complexity, has been studied. It has been investigated how improved performance achieved with ML models adds additional complexity to the overall network operation, and how this complexity of ML models can be reduced while maintaining the same improved performance. For this analysis, NNs, often a go-to choice when it comes to ML model selection, have been considered, and failure identification has been the use-case. Two novel NN compression techniques, \emph{Iterative Neuron Removal} and \emph{Guided Knowledge Distillation}, have been proposed, and their effectiveness has been demonstrated in the form of a significant reduction in computational complexity quantified in terms of floating point operations (FLOPs) per inference and memory footprint, while having no significant impact on NN classification performance.
Alarms from the Network Management System (NMS), which contain important information about network operation, have also been used for ONFM in this thesis. A few of the challenges associated with raw alarms have been addressed. These challenges include the high cardinality of alarm data for ML, the unavailability of corresponding alarm labels, and the vendor-specificity of NMS and alarms. To deal with high cardinality, an alarm data pre-processing approach has been proposed that effectively reduces the dimensions of high-dimensional data for ML. The proposed approach has been demonstrated to significantly reduce training and inference times for multiple ML models while having no significant impact on classification performance. Regarding unavailability of labelled alarms data, an unsupervised ML technique for alarm classification has been proposed that does not require any labels. The obtained results suggest that up to 100\% failure alarms can be identified, albeit at the expense of low precision, depending on the threshold selection. Last but not the least, failure root-cause identification has been investigated using alarms in a vendor-agnostic manner. The proposed approach makes use of alarms' information such as the impacted network element and probable cause of the fault, which is provided by NMS of all the vendors. This approach has been tested on alarms extracted from a production network, and the results indicate that failure root-cause identification can be achieved in this manner. However, the effectiveness of the proposed approach has yet to be validated by operational staff.
To summarize, a few issues regarding ML-assisted fault management have been addressed, and several solutions to make ML techniques more effective in dealing with optical network failures have been successfully proposed in the scope of this thesis.
The thesis starts with the focus on the quality of training data, which is critical for achieving optimal performance of ML models. Within quality of training data, the primary focus of this thesis is on reducing the imbalance in training data for classification problems (e.g., type of failure classification) to prevent the ML model from being biased towards the well-represented scenario(s) in the data. A variational-autoencoder-based technique has been proposed for data augmentation, which involves the generation of synthetic samples to reduce imbalance in the training data. The proposed approach has been shown to be effective in reducing the training time of ML models, specifically neural networks (NNs), for failure detection and identification. Other demonstrated benefits include reduced computational complexity and improved classification accuracy.
Next, considering the significance of data imbalance and its relevance to ONFM, a comprehensive comparative analysis of various model-centric and data-centric techniques for addressing this issue has been conducted. The dataset used for this comparison has been obtained from an experimental testbed containing commercial devices. The effectiveness of model-centric and data-centric approaches has been compared in terms of their impact on classification performance and additional computational cost associated with each of these techniques. Failure identification has been considered as a use-case for this analysis.
Following this, another very important yet overlooked performance aspect of ML models: computational complexity, has been studied. It has been investigated how improved performance achieved with ML models adds additional complexity to the overall network operation, and how this complexity of ML models can be reduced while maintaining the same improved performance. For this analysis, NNs, often a go-to choice when it comes to ML model selection, have been considered, and failure identification has been the use-case. Two novel NN compression techniques, \emph{Iterative Neuron Removal} and \emph{Guided Knowledge Distillation}, have been proposed, and their effectiveness has been demonstrated in the form of a significant reduction in computational complexity quantified in terms of floating point operations (FLOPs) per inference and memory footprint, while having no significant impact on NN classification performance.
Alarms from the Network Management System (NMS), which contain important information about network operation, have also been used for ONFM in this thesis. A few of the challenges associated with raw alarms have been addressed. These challenges include the high cardinality of alarm data for ML, the unavailability of corresponding alarm labels, and the vendor-specificity of NMS and alarms. To deal with high cardinality, an alarm data pre-processing approach has been proposed that effectively reduces the dimensions of high-dimensional data for ML. The proposed approach has been demonstrated to significantly reduce training and inference times for multiple ML models while having no significant impact on classification performance. Regarding unavailability of labelled alarms data, an unsupervised ML technique for alarm classification has been proposed that does not require any labels. The obtained results suggest that up to 100\% failure alarms can be identified, albeit at the expense of low precision, depending on the threshold selection. Last but not the least, failure root-cause identification has been investigated using alarms in a vendor-agnostic manner. The proposed approach makes use of alarms' information such as the impacted network element and probable cause of the fault, which is provided by NMS of all the vendors. This approach has been tested on alarms extracted from a production network, and the results indicate that failure root-cause identification can be achieved in this manner. However, the effectiveness of the proposed approach has yet to be validated by operational staff.
To summarize, a few issues regarding ML-assisted fault management have been addressed, and several solutions to make ML techniques more effective in dealing with optical network failures have been successfully proposed in the scope of this thesis.
File
Nome file | Dimensione |
---|---|
PhD_Thes...Lareb.pdf | 21.45 Mb |
Contatta l'autore |