Clustering the messages

I am continue with telling about the tasks I faced during the implementing of oVirt log analyzer.

What should the user see?

One of the most important issues about the analyzer’s invention is how to classify the level of messages’ “importance”. Even if you want to analyze logs of only several minutes time range, they might contain thousands of lines. User should see only useful and important messages, while the others should be filtered. The first idea is to use some keywords in messages, for example “error” - meaning that the message could be important as there was something wrong. But what if you want to understand the reasons caused this error? They could be presented in chains of on first sight ordinary messages.

How does it work

Different parts of oVirt system generates many logs. Each of them contain detailed information about itself and also an information related to other parts during their communication. The ideal way to implement log files analysis is to restore the scheme from all logs. I tried to do this before the start of the internship with very simple way (just linking only errors if they were followed by the increased number of errors). Of course there are too many difficulties to do this, and after implementing the parsing I decided to test another method.

Message’s features

Each message in the log file may be more or less useful when solving different tasks. For example, if you are searching for a reason of some error, you might want to examine all previous warnings, and if you want to make sure that the configuration of your system was correct, you need to see messages representing its structure. All in all, in theory all messages can me marked and classified according to their meaning. I suggested the following metrics:

  • Including the «danger» words like error, warning, fail, etc.
  • Including the user-defined words, like some ID or part of a message.
  • Checking if the number of errors of warning increased after the current message compared to the previous within a time window.
  • Some other important identifiers, for example flags for showing long operations or tasks, that are constructed previously.

I’d also like to share the method of solving another important task. This task is a first step of recognizing the content of the message: we need to clusterize all messages into groups where messages within each group will me similar. This stage is important because log files are generating by algorithms and basically consists of pre-defined messages with some differences.

Clustering messages

The first simple step to unify similar messages was to remove all different parts from them. Typical situation in such messages is placing a unique information in brackets or quotes. Simple general illustration:

New message "Hello" was received (Anastasia)
New message "Good bye" was received (Victoria)

Now most unique parts are filtered, but usually messages have more complex structure, for example, nested brackets. So I decided to compare the content of all messages with each other and construct a distance matrix.

Distance between messages

Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. (Source)

def distance(a, b):
    "Calculates the Levenshtein distance between a and b."
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a, b = b, a
        n, m = m, n

    current_row = range(n+1) # Keep current and previous row, not entire matrix
    for i in range(1, m+1):
        previous_row, current_row = current_row, [i]+[0]*n
        for j in range(1,n+1):
            add, delete, change = previous_row[j]+1, current_row[j-1]+1, previous_row[j-1]
            if a[j-1] != b[i-1]:
                change += 1
            current_row[j] = min(add, delete, change)

    return current_row[n]

I tried the implementation from editdistance. It works much more faster than other that I tried, but is still quite slow, especially for long strings. Then I limited the string line and the speed become acceptable for our task.

OK, now we have a matrix, where small numbers represents a small amount of symbols to be changed/removed to make strings equal. Of course it highly depends on the string line, so the result was normalized on the full length of the string. The next step is to create groups of nearest neighbors from the known distances. The difficulties hides in the fact that we don’t know, how many clusters there will be. So, I found the solution to use the only algorithm from scikit-learn that works without knowing the number of clusters.

Clustering

DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density. (Source)

It may be more demanding on data compared to the KNN (for example), but works very fast and gives an informative result. Messages that didn’t receive a cluster got a -1 marker (algorithm considered them a noise), but in our interpretation they have a unique content.

The problem was is the fact that a Levenshtein distance does not give a preference to an order of words. This means that in messages have similar tail but different beginning may go to one cluster. But in our task this situation is not typical: usually the informative part of message is located in the beginning, ant other long unique different indicators are placed at the end. For example, if you are debugging your program and need to check the value of some variable, most likely you will write print('my_count =', count). Thus, messages with different beginning was penalized, and vise versa, distance between messages with similar beginnings was reduced.

Make it simpler As a result, I had to make the algorithm faster, and I tested the more simple way of clustering, and then left it due to acceptable accuracy. This is the way of placing each message on a cluster named by its first two words, and works fine for our task.

Conclusion

The implementing of messages’ clustering allows me to add the following features for messages:

  • Frequency (size of the cluster)
  • Does messages in the cluster differ by particular words, for example, IDs
  • The coverage of the message of the whole log file and timeline

At the moment I am improving the details of the analyzer’s implementation and testing these features. The next post most likely will be about creating the interface. Thanks for reading :)

First speech in english

Hi everyone!

On 18th July I was asked to tell about our research group’s key scientific project, and I had to do this almost without preparing. This was my first experience of speaking in English, not counting the speech on the “Falling Walls Lab”, because by the rules the speech was to last only three minutes and I had a lot of time to prepare for it.

If you are intrested in improving stereoscopic video quality, watch the video of my speech:

Parallelizing the parsing

When processing large log files, I faced the problem of the insufficient speed of my method. There are many places to be improved, but I decided to start with the parallelizing it and then analyze the algorithm in detail.

As the parser reads log file line-by-line, the first idea was to parallelize it by lines. However, as I have already described in the previous post, there are multi-line-lines. If we split the file in groups of lines, we may divide such lines and they won’t be processed correctly. So, for start I decided to parallelize the parsing by the files.

Multiprocessing tool

from multiprocessing import Manager, Pool

The multiprocessing module allows to spawn processes in the same manner than you can spawn threads with the threading module. The multiprocessing package includes API, and there is a Pool class that can be used to parallelize executing a function across multiple inputs. I’d like to show how simple is to do this.

Pool

with Pool(processes=5) as pool:
    worker = pool.imap(func, run_args)
    for res in worker:
    	print(res)

Here I decided to parallelize the executing in 5 threads, and the result in returning to the variable worker. The returned values are sorted in their original order. I could stop on this stage, but there was a problem. I used a file descriptor for saving the parser’s messages, it could be stdout or stderr streams, or any text file. The descriptor can’t be shared between the processes, and I’ve found the solution.

Queue

There are many ways to organize communication between processes, and I choose the existing in the multiprocessing library class Manager. It controls a server process which holds Python objects and allows other processes to manipulate them using proxies. A manager returned by Manager() supports many types, and one of them is Queue. Processes write to the queue and then I get all the written messages and can put them to the needed file.

m = Manager()
q = m.Queue()
run_args = [[q, i, filenames, path] for i in idxs]
#...
while not q.empty():
    warn = q.get()
    print(warn)

Progressbars

I decided to add a progress bar for the parsing process. I used a Progressbar library, which can be installed via pip install progressbar2.

import progressbar

This library is also easy in use, and I’d like to show the example of creating the progress bar in my task.

Progress bar for parsing

widget_style = ['All: ', progressbar.Percentage(), ' (',
                         progressbar.SimpleProgress(), ')', ' ',
                         progressbar.Bar(), ' ', progressbar.Timer(), ' ',
                         progressbar.AdaptiveETA()]
bar = ProgressBar(widgets=widget_style)
with Pool(processes=5) as pool:
    worker = pool.imap(star, run_args)
    for _ in bar(run_args):
        result += [worker.next()]

Progress bar for sub-processes

There can be situations when one of the log files is bigger then the others, and it is useful to know how many lines are already parsed in each file. The idea was to add a progressbar to each process, and also keep a main progress bar. Since one log message is not always located in the one line, I decided to measure the progress in symbols, counting them with file.seek(0, os.SEEK_END). The part of the solution is presented below, and I used a curses interface to display the progress bars in console.

# creating progressbars for subprocesses
def runner_parallel(inp):
    function, args, name, queue, order_idx = inp
    idx = int(multiprocessing.current_process().name.split("-")[1])
    custom_text = progressbar.FormatCustomText(
                            '{} - %(type_op)s: '.format(name),
                            dict(type_op="Start"))
    widget_style = [custom_text, progressbar.Percentage(), ' (',
                    progressbar.SimpleProgress(), ')', ' ',
                    progressbar.Bar(), ' ', progressbar.Timer(), ' ',
                    progressbar.AdaptiveETA()]
    args += [ProgressBar(widgets=widget_style, fd=Writer((0, idx - 1),
                         queue)), custom_text]
    return (function(*args), order_idx)

# ...in the main
manager = multiprocessing.Manager()
the_queue = manager.Queue()
result = []
widget_style = ['All: ', progressbar.Percentage(), ' (',
                progressbar.SimpleProgress(), ')', ' ',
                progressbar.Bar(), ' ', progressbar.Timer(), ' ',
                progressbar.AdaptiveETA()]
run_args = [(func, args, name, the_queue, order_idx)
            for order_idx, (func, name, args) in enumerate(run_args)]
main_pb = ProgressBar(widgets=widget_style, fd=Writer((0, 0)),
					  max_value=len(run_args))
with Pool(processes=5) as pool:
    workers = pool.imap_unordered(runner_parallel, run_args)
    idx = 0
    main_pb.start()
    while True:
        try:
            try:
                while True:
                    result.append(workers.next(0))
                    idx += 1
                    main_pb.update(idx)
            except multiprocessing.TimeoutError:
                pass
            while not the_queue.empty():
                pos, text = the_queue.get()
   				# print the progress
            time.sleep(1)
        except StopIteration:
            break
result = sorted(result, key=lambda x: x[1])
return [r[0] for r in result]

And here is how it looks: ProgressPool

Conclusion

The use of multiprocessing helps me to decrease the time of parsing log files, even with parallelizing by files. I previously implemented parallelization only in the course “Operating systems”. The task was to write Unix shell using C language. Some commands, for example dir, tale we had to write ourselves, others were calling from the our shell. The most difficult part was the creation of a conveyor (| symbol), then the processes were created and executed without waiting for their completion. I will never forget the WNOHANG flag in the waitpid command, and how our teacher was asking us about the details and definition of this flag for tracking zombie processes. It was interesting and useful to learn the principle of parallel processes and communication between them, as well as signal redefinition and the use of semaphores, messages and shared memory. But now how nice is to use a Pool and do all this complex actions in one line! :)

Parsing text files with different formats

The problem

The first stage of my work was to study how to read and recognize the content of different logs. I faced the situation when each part of the system has its own log format, and different versions may have slightly different formats:

...
2010-10-10 12:10:10,000 Warning [Some information] Message text 
...
Info::2010-10-10 12:10:10,000::(Some index)::[Some information] Message text (Some other code)
...

I had several ideas about solving this task, and I’d like to share the results. We discussed two opportunities for recognizing different formats:

  • read a file line-by-line and collect the repeated information in the lines, that represents the format. Separators, their number and order would make the template for a log file line

  • define a template manually for each format. This method assumes the user to add a new template if a new format appears to be analyzed, but the advantage is in saving time with no need to pre-read each log to construct the format template.

To avoid unexpected difficulties I’ve implemented the second approach and left the possibility to create an automatic template-building.

Simple solution

I used regular expression to define the file template. There are named fields with possible characters and separators in the templates, and all these constraints I’ve received after testing them on different log files that I have. On this stage on-line regular expression editor Pythex helped me a lot. After I have received a templates that matches only “their” formats, I’ve placed them in order from the most detailed (with many constrains) to the most common. I would spend a couple of days on this task, if there were no other difficulties.

Facing difficulties

Since I analyze the log files, I can expect that each message would match the template of the format (no). So some lines suddenly does not match the template. We want to save as much information as we can, so it is better to parse them somehow of at least save the raw line. I list some of the examples when a line differs from the log file format:

Disappering of date and time

...
2010-10-10 12:10:10,000 Warning [Some information] Message text 
Warning: some warning text
2010-10-10 12:10:10,500 Info [Some information] Message text 
...

I found out that not all log file lines include a date+time label. For example, some system messages, that could appear in any time between the previous and following log messages. The idea of processing such cases is in copying the date and time from the previous message and noticing it in the text of the parsed line. So if the date and tine was not detected in the log line, the line is sawed in a raw format and required fields are copied from the previous message. But even if the date+time is found it doesn’t mean that the line will be matched.

Disappering of date and time and it’s appearing on the other place in the line

...
2010-10-10 12:10:10,000 Warning [Some information] Message text 
Warning: some warning text about 2010-05-24 18:30:16,400 date and time action
2010-10-10 12:10:10,500 Info [Some information] Message text 
...

We can found a date+time in a log line, but it is not located in the defined place for date times and just represents some other date+time in the message that is not related to the time of the line was written to the log file. All we can do is to try to match full line with the file format template and handle the exception, which would tell us about the wrong location of the date+time. The processing of this case also involves copying information from the previous message.

One of the line fields includes \n

...
2010-10-10 12:10:10,000 Warning [Some information] Message text 
and it's continuation
and more continuation
and ending

2010-10-10 12:10:10,500 Info [Some information] Message text 
...

The most simple case is when the last line field become miltiline (common example – Tracebacks). Then the line will match the template as on this step we don’t know anything about the next line and it’s relation to the current line. So if we read the line and parse it, and then found out that the next line does not match the template, we should concatenate it with the previous line, and do it until we read the “normal” matched line. This means that we should store the line until we make sure that we’ve read the full message.

The more complex case is when the non-last field become multiline and moves the template-required fields of the message to another line.

...
2010-10-10 12:10:10,000 Warning [Some information] Message text 
and it's continuation
and (Some other code)

2010-10-10 12:10:10,500 Info [Some information] Message text (Some other code)
...

What do we see in this situation? The first line conclude the date+time, but does not match the template – the explanation coincides the second case, but represents the other situation that can be solved. The idea is again to concatenate the lines until we reach the “normal” line, and then try to match the constructed line to the template. If there were lost fields, they will be found, and if the situation was about disappeared date+time, we will do the instructions from that case.

Combination of several exceptions

...
2010-10-10 12:10:10,000 Warning [Some information] Message text 
and it's continuation
and more continuation about 2010-05-24 18:30:16,400 date and time action
and ending
and (Some other code)

Warning: some warning message
2010-10-10 12:10:10,500 Info [Some information] Message text (Some other code)
...

Of course during solving all the described non-standard situations I have been searching for their combinations and tested the implemented algorithm for them. One of the features of the parsing algorithm is in the constrains that the user could impose to it. For example, he or she could provide a date and time range in which the log messages are to be showed or define a particular keyword that should be a part of a message. So the analyzer do not have to process and store all rows, only those that correspond to conditions. And we have to check the conditions after we parse the whole message, because the found date+time could not relate to the message appearance, and the keywords could be located on the non-first line of the message, if it contains \n.

Conclusion

The process of solving this problem was interesting and unusual for me. During scientific work and studies it turns out that you have a tiny bit of standardized data over which you need to write a lot of different things. But in is the case you have 100 lines of algorithm and you improve and modify them because every time there appears a new case, Which must be taken into account. The next stage promises to be more familiar to me and I will do my best and hope to meet expectations for the result.

Choosing type of interface in a new project

My previous experience

During my studies and scientific work, I have not faced the task of creating a user-friendly and functional graphic interface. Homework is usually set to create an effective algorithm. And, of course, we do them primarily based on the basic requirements. We sent our solutions and pass the exam, and the written code is unlikely to be used ever - so the most I have wrote for my programs is the command line interface (getting arguments and processing flags).

In scientific work research and searching a new solution for the problem also predominates under creating useful interface. So when I faced with need to choose the best interface for the task I am solving, I went through this area. We сonsidered the following types:

  • Command line interfaces
    • Argparser
    • Curses
  • Graphic interfaces
    • PyQT
    • Web interface

With help of my mentor I have prepared a list of pros and cons of each type of interface regarding our task.

Command line interfaces

  • Argparser
import argparse

parser = argparse.ArgumentParser(
        description='This is an exaple argparser')
    parser.add_argument('dir',
                        metavar='directory',
                        type=str,
                        nargs=1,
                        help='Path to a directory with file(s)')
    parser.add_argument('filename',
                        metavar='filename(s)',
                        type=str,
                        nargs=+,
                        help='Filenames (space-separated)')
    parser.add_argument('-out', '--outfile',
                        type=str,
                        help='Filename to save the output (optional)')

I use it now, during the developement, for which it is quite convenient. But on the last stage of my internship I could provide useful ans user-friendly tools and interface, so I took a look on other types.

  • Curses Exaple of curses interface
    • (+) curses will be more convenient to run on a remote servers without X11
    • (+) it will be easier to implement
    • (-) perhaps it will be convenient only for advanced users
    • (-) you will need to use some graphical interface if you want to visualise the data, but probably this is not so critical

There are also alternatives for python curses, for example I found Urwid

Graphic interfaces

  • PyQT
    • (+) cross-platform
    • (+) easy to use
    • (-) need to install pyqt package
    • (-) may be slow over remote connections
  • Web interface
    • (+) more cross-platform as you can use it even with the tablet
    • (+) no need to install extra packages
    • (+) I could add interactive charts (I’ve used Highcharts, but there are many open-source libraries, for example D3: Data-Driven Documents
    • (-) it can be more difficult to implement, but there are many frameworks, for example I’ve tried the examples from Flask http://flask.pocoo.org
    • (-) a running web application must be accessed some way, typically over ssh forwarding, which is some hassle

Web interface seems to be more attractive to me, however, its implementation can take a long time. Since I would like to devote more time to developing an algorithm, we decided to choose the curses/urwid interface. At the same time I’ll leave the possibility to change the interface if necessary, so if there is enough time, I would try to create a web interface.

I hope my resume will help to get an initial idea of what interface is best for you, but do not forget that the choice depends on the task that you decide and the time available to you.