AutoMon. Automatic monitoring and problem detection for distributed systems

Wikstad, Magnus

dc.contributor.advisor	Bjørndalen, John Markus
dc.contributor.author	Wikstad, Magnus
dc.date.accessioned	2016-07-01T10:28:45Z
dc.date.available	2016-07-01T10:28:45Z
dc.date.issued	2016-05-15
dc.description.abstract	When working with distributed systems, detecting faults can be a difficult task, as abnormalities isn't necessarily immediately evident by warnings or system crashes. This is especially true with subtle faults, such as variations in performance of a running program, it is not necessarily its own fault, but could rather be from a different source, somewhere in the cluster, using a lot of resources (CPU, IO, etc.), thereby causing other programs to perform sub-par compared to earlier executions. These types of problems won't necessarily be detected by regular cluster monitoring tools, as these only look at cluster metrics, or by distributed debuggers, as these only monitor specific programs, and thus won't find the cause for the degraded performance if it comes from a different source. As the usage of distributed systems is becoming more common amongst those without an intimate knowledge about these systems, being able to quickly inform the user about any faults or abnormalities, would be a great improvement on their efficient use of the system. It would additionally be a great help to developers, as they could easily get their programs performance data without implementing specific procedures for the task, thus simplifying the development of new distributed software. This thesis is looking to discover if the system, and process, information attainable from each nodes operating system, is enough to detect abnormal operation. This is approached by creating a prototype system that collects this information from the cluster, and doing analysis on the data during runtime to check for faults. The achieved system is capable of collecting large amounts of data from the cluster, storing it, and doing some rudimentary analysis on the data. While leaving most of the clusters resources free for its computations. This shows that it is possible to create a low resource cluster monitoring tool, that collects large amounts of system data, with high frequency, from each of the nodes, and analyze the data.	en_US
dc.identifier.uri	https://hdl.handle.net/10037/9359
dc.identifier.urn	URN:NBN:no-uit_munin_8918
dc.language.iso	eng	en_US
dc.publisher	UiT Norges arktiske universitet	en_US
dc.publisher	UiT The Arctic University of Norway	en_US
dc.rights.accessRights	openAccess
dc.rights.holder	Copyright 2016 The Author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/3.0	en_US
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)	en_US
dc.subject.courseID	INF-3990
dc.subject	VDP::Technology: 500::Information and communication technology: 550::Computer technology: 551	en_US
dc.subject	VDP::Teknologi: 500::Informasjons- og kommunikasjonsteknologi: 550::Datateknologi: 551	en_US
dc.title	AutoMon. Automatic monitoring and problem detection for distributed systems	en_US
dc.type	Master thesis	en_US
dc.type	Mastergradsoppgave	en_US