First speech in english

Hi everyone!

On 18th July I was asked to tell about our research group’s key scientific project, and I had to do this almost without preparing. This was my first experience of speaking in English, not counting the speech on the “Falling Walls Lab”, because by the rules the speech was to last only three minutes and I had a lot of time to prepare for it.

If you are intrested in improving stereoscopic video quality, watch the video of my speech:

Parallelizing the parsing

When processing large log files, I faced the problem of the insufficient speed of my method. There are many places to be improved, but I decided to start with the parallelizing it and then analyze the algorithm in detail.

As the parser reads log file line-by-line, the first idea was to parallelize it by lines. However, as I have already described in the previous post, there are multi-line-lines. If we split the file in groups of lines, we may divide such lines and they won’t be processed correctly. So, for start I decided to parallelize the parsing by the files.

Multiprocessing tool

from multiprocessing import Manager, Pool

The multiprocessing module allows to spawn processes in the same manner than you can spawn threads with the threading module. The multiprocessing package includes API, and there is a Pool class that can be used to parallelize executing a function across multiple inputs. I’d like to show how simple is to do this.

Pool

with Pool(processes=5) as pool:
    worker = pool.imap(func, run_args)
    for res in worker:
    	print(res)

Here I decided to parallelize the executing in 5 threads, and the result in returning to the variable worker. The returned values are sorted in their original order. I could stop on this stage, but there was a problem. I used a file descriptor for saving the parser’s messages, it could be stdout or stderr streams, or any text file. The descriptor can’t be shared between the processes, and I’ve found the solution.

Queue

There are many ways to organize communication between processes, and I choose the existing in the multiprocessing library class Manager. It controls a server process which holds Python objects and allows other processes to manipulate them using proxies. A manager returned by Manager() supports many types, and one of them is Queue. Processes write to the queue and then I get all the written messages and can put them to the needed file.

m = Manager()
q = m.Queue()
run_args = [[q, i, filenames, path] for i in idxs]
#...
while not q.empty():
    warn = q.get()
    print(warn)

Progressbars

I decided to add a progress bar for the parsing process. I used a Progressbar library, which can be installed via pip install progressbar2.

import progressbar

This library is also easy in use, and I’d like to show the example of creating the progress bar in my task.

Progress bar for parsing

widget_style = ['All: ', progressbar.Percentage(), ' (',
                         progressbar.SimpleProgress(), ')', ' ',
                         progressbar.Bar(), ' ', progressbar.Timer(), ' ',
                         progressbar.AdaptiveETA()]
bar = ProgressBar(widgets=widget_style)
with Pool(processes=5) as pool:
    worker = pool.imap(star, run_args)
    for _ in bar(run_args):
        result += [worker.next()]

Progress bar for sub-processes

There can be situations when one of the log files is bigger then the others, and it is useful to know how many lines are already parsed in each file. The idea was to add a progressbar to each process, and also keep a main progress bar. Since one log message is not always located in the one line, I decided to measure the progress in symbols, counting them with file.seek(0, os.SEEK_END). The part of the solution is presented below, and I used a curses interface to display the progress bars in console.

# creating progressbars for subprocesses
def runner_parallel(inp):
    function, args, name, queue, order_idx = inp
    idx = int(multiprocessing.current_process().name.split("-")[1])
    custom_text = progressbar.FormatCustomText(
                            '{} - %(type_op)s: '.format(name),
                            dict(type_op="Start"))
    widget_style = [custom_text, progressbar.Percentage(), ' (',
                    progressbar.SimpleProgress(), ')', ' ',
                    progressbar.Bar(), ' ', progressbar.Timer(), ' ',
                    progressbar.AdaptiveETA()]
    args += [ProgressBar(widgets=widget_style, fd=Writer((0, idx - 1),
                         queue)), custom_text]
    return (function(*args), order_idx)

# ...in the main
manager = multiprocessing.Manager()
the_queue = manager.Queue()
result = []
widget_style = ['All: ', progressbar.Percentage(), ' (',
                progressbar.SimpleProgress(), ')', ' ',
                progressbar.Bar(), ' ', progressbar.Timer(), ' ',
                progressbar.AdaptiveETA()]
run_args = [(func, args, name, the_queue, order_idx)
            for order_idx, (func, name, args) in enumerate(run_args)]
main_pb = ProgressBar(widgets=widget_style, fd=Writer((0, 0)),
					  max_value=len(run_args))
with Pool(processes=5) as pool:
    workers = pool.imap_unordered(runner_parallel, run_args)
    idx = 0
    main_pb.start()
    while True:
        try:
            try:
                while True:
                    result.append(workers.next(0))
                    idx += 1
                    main_pb.update(idx)
            except multiprocessing.TimeoutError:
                pass
            while not the_queue.empty():
                pos, text = the_queue.get()
   				# print the progress
            time.sleep(1)
        except StopIteration:
            break
result = sorted(result, key=lambda x: x[1])
return [r[0] for r in result]

And here is how it looks: ProgressPool

Conclusion

The use of multiprocessing helps me to decrease the time of parsing log files, even with parallelizing by files. I previously implemented parallelization only in the course “Operating systems”. The task was to write Unix shell using C language. Some commands, for example dir, tale we had to write ourselves, others were calling from the our shell. The most difficult part was the creation of a conveyor (| symbol), then the processes were created and executed without waiting for their completion. I will never forget the WNOHANG flag in the waitpid command, and how our teacher was asking us about the details and definition of this flag for tracking zombie processes. It was interesting and useful to learn the principle of parallel processes and communication between them, as well as signal redefinition and the use of semaphores, messages and shared memory. But now how nice is to use a Pool and do all this complex actions in one line! :)

Parsing text files with different formats

The problem

The first stage of my work was to study how to read and recognize the content of different logs. I faced the situation when each part of the system has its own log format, and different versions may have slightly different formats:

...
2010-10-10 12:10:10,000 Warning [Some information] Message text 
...
Info::2010-10-10 12:10:10,000::(Some index)::[Some information] Message text (Some other code)
...

I had several ideas about solving this task, and I’d like to share the results. We discussed two opportunities for recognizing different formats:

  • read a file line-by-line and collect the repeated information in the lines, that represents the format. Separators, their number and order would make the template for a log file line

  • define a template manually for each format. This method assumes the user to add a new template if a new format appears to be analyzed, but the advantage is in saving time with no need to pre-read each log to construct the format template.

To avoid unexpected difficulties I’ve implemented the second approach and left the possibility to create an automatic template-building.

Simple solution

I used regular expression to define the file template. There are named fields with possible characters and separators in the templates, and all these constraints I’ve received after testing them on different log files that I have. On this stage on-line regular expression editor Pythex helped me a lot. After I have received a templates that matches only “their” formats, I’ve placed them in order from the most detailed (with many constrains) to the most common. I would spend a couple of days on this task, if there were no other difficulties.

Facing difficulties

Since I analyze the log files, I can expect that each message would match the template of the format (no). So some lines suddenly does not match the template. We want to save as much information as we can, so it is better to parse them somehow of at least save the raw line. I list some of the examples when a line differs from the log file format:

Disappering of date and time

...
2010-10-10 12:10:10,000 Warning [Some information] Message text 
Warning: some warning text
2010-10-10 12:10:10,500 Info [Some information] Message text 
...

I found out that not all log file lines include a date+time label. For example, some system messages, that could appear in any time between the previous and following log messages. The idea of processing such cases is in copying the date and time from the previous message and noticing it in the text of the parsed line. So if the date and tine was not detected in the log line, the line is sawed in a raw format and required fields are copied from the previous message. But even if the date+time is found it doesn’t mean that the line will be matched.

Disappering of date and time and it’s appearing on the other place in the line

...
2010-10-10 12:10:10,000 Warning [Some information] Message text 
Warning: some warning text about 2010-05-24 18:30:16,400 date and time action
2010-10-10 12:10:10,500 Info [Some information] Message text 
...

We can found a date+time in a log line, but it is not located in the defined place for date times and just represents some other date+time in the message that is not related to the time of the line was written to the log file. All we can do is to try to match full line with the file format template and handle the exception, which would tell us about the wrong location of the date+time. The processing of this case also involves copying information from the previous message.

One of the line fields includes \n

...
2010-10-10 12:10:10,000 Warning [Some information] Message text 
and it's continuation
and more continuation
and ending

2010-10-10 12:10:10,500 Info [Some information] Message text 
...

The most simple case is when the last line field become miltiline (common example – Tracebacks). Then the line will match the template as on this step we don’t know anything about the next line and it’s relation to the current line. So if we read the line and parse it, and then found out that the next line does not match the template, we should concatenate it with the previous line, and do it until we read the “normal” matched line. This means that we should store the line until we make sure that we’ve read the full message.

The more complex case is when the non-last field become multiline and moves the template-required fields of the message to another line.

...
2010-10-10 12:10:10,000 Warning [Some information] Message text 
and it's continuation
and (Some other code)

2010-10-10 12:10:10,500 Info [Some information] Message text (Some other code)
...

What do we see in this situation? The first line conclude the date+time, but does not match the template – the explanation coincides the second case, but represents the other situation that can be solved. The idea is again to concatenate the lines until we reach the “normal” line, and then try to match the constructed line to the template. If there were lost fields, they will be found, and if the situation was about disappeared date+time, we will do the instructions from that case.

Combination of several exceptions

...
2010-10-10 12:10:10,000 Warning [Some information] Message text 
and it's continuation
and more continuation about 2010-05-24 18:30:16,400 date and time action
and ending
and (Some other code)

Warning: some warning message
2010-10-10 12:10:10,500 Info [Some information] Message text (Some other code)
...

Of course during solving all the described non-standard situations I have been searching for their combinations and tested the implemented algorithm for them. One of the features of the parsing algorithm is in the constrains that the user could impose to it. For example, he or she could provide a date and time range in which the log messages are to be showed or define a particular keyword that should be a part of a message. So the analyzer do not have to process and store all rows, only those that correspond to conditions. And we have to check the conditions after we parse the whole message, because the found date+time could not relate to the message appearance, and the keywords could be located on the non-first line of the message, if it contains \n.

Conclusion

The process of solving this problem was interesting and unusual for me. During scientific work and studies it turns out that you have a tiny bit of standardized data over which you need to write a lot of different things. But in is the case you have 100 lines of algorithm and you improve and modify them because every time there appears a new case, Which must be taken into account. The next stage promises to be more familiar to me and I will do my best and hope to meet expectations for the result.

Choosing type of interface in a new project

My previous experience

During my studies and scientific work, I have not faced the task of creating a user-friendly and functional graphic interface. Homework is usually set to create an effective algorithm. And, of course, we do them primarily based on the basic requirements. We sent our solutions and pass the exam, and the written code is unlikely to be used ever - so the most I have wrote for my programs is the command line interface (getting arguments and processing flags).

In scientific work research and searching a new solution for the problem also predominates under creating useful interface. So when I faced with need to choose the best interface for the task I am solving, I went through this area. We сonsidered the following types:

  • Command line interfaces
    • Argparser
    • Curses
  • Graphic interfaces
    • PyQT
    • Web interface

With help of my mentor I have prepared a list of pros and cons of each type of interface regarding our task.

Command line interfaces

  • Argparser
import argparse

parser = argparse.ArgumentParser(
        description='This is an exaple argparser')
    parser.add_argument('dir',
                        metavar='directory',
                        type=str,
                        nargs=1,
                        help='Path to a directory with file(s)')
    parser.add_argument('filename',
                        metavar='filename(s)',
                        type=str,
                        nargs=+,
                        help='Filenames (space-separated)')
    parser.add_argument('-out', '--outfile',
                        type=str,
                        help='Filename to save the output (optional)')

I use it now, during the developement, for which it is quite convenient. But on the last stage of my internship I could provide useful ans user-friendly tools and interface, so I took a look on other types.

  • Curses Exaple of curses interface
    • (+) curses will be more convenient to run on a remote servers without X11
    • (+) it will be easier to implement
    • (-) perhaps it will be convenient only for advanced users
    • (-) you will need to use some graphical interface if you want to visualise the data, but probably this is not so critical

There are also alternatives for python curses, for example I found Urwid

Graphic interfaces

  • PyQT
    • (+) cross-platform
    • (+) easy to use
    • (-) need to install pyqt package
    • (-) may be slow over remote connections
  • Web interface
    • (+) more cross-platform as you can use it even with the tablet
    • (+) no need to install extra packages
    • (+) I could add interactive charts (I’ve used Highcharts, but there are many open-source libraries, for example D3: Data-Driven Documents
    • (-) it can be more difficult to implement, but there are many frameworks, for example I’ve tried the examples from Flask http://flask.pocoo.org
    • (-) a running web application must be accessed some way, typically over ssh forwarding, which is some hassle

Web interface seems to be more attractive to me, however, its implementation can take a long time. Since I would like to devote more time to developing an algorithm, we decided to choose the curses/urwid interface. At the same time I’ll leave the possibility to change the interface if necessary, so if there is enough time, I would try to create a web interface.

I hope my resume will help to get an initial idea of what interface is best for you, but do not forget that the choice depends on the task that you decide and the time available to you.

Hello, World

“Do not judge me strictly, this is my first post…”

Thanks to participating in the Outreachy program, I’ve created this blog. I just want to apologize for my clumsy English, I’ll work on it, and I hope that blogging could help me. To be honest, I do not know what I will write about yet (I do not like writing about myself). I’ll try to write down something useful about my work, study and research.