DTA

Archivio Digitale delle Tesi e degli elaborati finali elettronici

 

Tesi etd-10232024-213804

Tipo di tesi
Corso Ordinario Secondo Livello
Autore
MANGIACAPRE, MARCO LUCIO
URN
etd-10232024-213804
Titolo
Software-based fault-tolerance with OpenMP
Struttura
Classe Scienze Sperimentali
Corso di studi
INGEGNERIA - INGEGNERIA
Commissione
Tutor Prof. CASTOLDI, PIERO
Relatore ROYUELA ALCAZAR, SARA
Relatore Prof. CUCINOTTA, TOMMASO
Presidente Prof.ssa BOGONI, ANTONELLA
Membro Dott.ssa CREA, SIMONA
Membro Prof. ABENI, LUCA
Membro Prof. ANDREUSSI, TOMMASO
Membro Prof. AVIZZANO, CARLO ALBERTO
Membro Prof. MICERA, SILVESTRO
Membro Prof. ODDO, CALOGERO MARIA
Membro Prof. RICOTTI, LEONARDO
Parole chiave
  • checkpointing
  • error recovery
  • OMP
  • OMP tasking
  • OpenMP
  • radiation resilience
  • task
  • Xilinx
  • Xilinx zcu102
Data inizio appello
09/12/2024;
Disponibilità
parziale
Riassunto analitico
Parallel programming has been continuously increasing its importance in the last decades as the number of cores available even in consumer's computer CPUs has grown significantly. In recent years, this increment in compute power has affected also devices previously always characterized by extremely limited capabilities: multi-core embedded devices with multiple accelerators are today available. The union between multi-core devices and embedded systems introduced new reliability concerns in software development, being parallel programming famous for the variety and complexity of possible failures and bugs. In a such error-prone environment, programming models like OpenMP appear as excellent solutions to avoid having to manually write the thread management code, then saving from the risk of many common bugs. However, just simplifying development is not sufficient in extreme environment were radiation-caused bit errors can occur probabilistically at any point compromising the correctness of mathematically correct algorithms. These kind of issues have been normally handled by using special hardware, capable of detecting anomalies and redoing operations automatically, but this kind of hardware is extremely expensive and that prevent its use in many context. In this research work a new approach trying to tackle these problems - at least partially - in software as been developed, aiming to allow the use of parallel customer hardware even in open space by relying on a verify-or-repeat approach.
File