In the era of big data, data volumes continue to grow in several different domains, from business to scientific fields. Sensors, edge devices, scientific applications and detectors generate huge amounts of data that are distributed for their nature. In order to extract value from such data requires a typical pipeline made of two main steps: first, the processing and then the data access. One of the main features for data access is fast response time, whose order of magnitude can vary a lot depending on the specific type of processing as well as processing patterns. The optimization of the access layer becomes more and more important while dealing with a geographically distributed environment where data must be retrieved from remote servers of a data lake. From the infrastructural perspectives, caching systems are used to mitigate latency and to serve better popular data. Thus, the role of the cache becomes a key to have an effective and efficient data access. In this article, we propose a Reinforcement Learning approach, using the Q-Learning technique, to improve the performances of a cache system in terms of data management. The proposed method uses two agents with different objectives and actions to control the addition and the eviction of files in the cache. The aim of this system is to increase the throughput reducing, at the same time, the cache costs, such as the amount of data written, and network utilization. Moreover, we tested our method in a context of data analysis, with information taken from High Energy Physics (HEP) workflow.

Effective Big Data Caching through Reinforcement Learning

Tracolli M.;Baioletti M.;Poggioni V.;
2020

Abstract

In the era of big data, data volumes continue to grow in several different domains, from business to scientific fields. Sensors, edge devices, scientific applications and detectors generate huge amounts of data that are distributed for their nature. In order to extract value from such data requires a typical pipeline made of two main steps: first, the processing and then the data access. One of the main features for data access is fast response time, whose order of magnitude can vary a lot depending on the specific type of processing as well as processing patterns. The optimization of the access layer becomes more and more important while dealing with a geographically distributed environment where data must be retrieved from remote servers of a data lake. From the infrastructural perspectives, caching systems are used to mitigate latency and to serve better popular data. Thus, the role of the cache becomes a key to have an effective and efficient data access. In this article, we propose a Reinforcement Learning approach, using the Q-Learning technique, to improve the performances of a cache system in terms of data management. The proposed method uses two agents with different objectives and actions to control the addition and the eviction of files in the cache. The aim of this system is to increase the throughput reducing, at the same time, the cache costs, such as the amount of data written, and network utilization. Moreover, we tested our method in a context of data analysis, with information taken from High Energy Physics (HEP) workflow.
2020
978-1-7281-8470-8
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11391/1495765
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact