Fixing a Corrupted Apache Log File

The Apache access.log file is a good place to look for evidence of hacking activity. Code injections, brute force attacks and excessive crawling all show up in there, along with legitimate hits. While searching recently, I was surprised to see that Linux had started to regard the file as binary data:

$ grep something access.log
Binary file access.log matches

Check the File Format

Rather then print matching lines, grep just informs us that the file is “binary”, and contains a match. But Apache log files are usually text, like the previous log, here::

$ file access.log access.log.1
access.log: data
access.log.1: ASCII text, with very long lines

What is causing the misinterpretation? Could it be some odd code injection, or other untoward content? Well, one of the things that “file” uses to make its mind up about a file is the presence of control (or non printing) characters. Let’s have a closer look.

Find the Control Characters

Finding control characters isn’t easy. By their nature, they are difficult to grep. We can’t type them directly. We don’t even know which ones to look for. And most Unix tools reject the file as “binary” anyway, and won’t print matching lines.

Cat -v is a very useful command in these situations. It prints the file, showing non-printing characters, which it presents as strings starting with “^”. Very simple, and it seems to catch everything. The access.log file is 30,000 lines long, so let’s cat -v the file and compare with the original.

$ cat -v access.log > /tmp/mylog
$ diff access.log /tmp/mylog
3100c3100
< 141.8.142.73 - - [29/Apr/2019:13:01:04 +0100] "GET /robots.txt HTTP/1.1" 200 386 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@141.8.142.73 - - [29/Apr/2019:13:01:04 +0100] "GET /robots.txt HTTP/1.1" 200 386 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"

And there it is. The file has a load of “control-at” characters inserted at the start of line 3100.

Fix the File

Fix the file by simply removing line 3100. Proceed as follows

$ sed -n '1,3099p' /tmp/mylog > /tmp/mylog.fixed
$ sed -n '3101,$p' /tmp/mylog >> /tmp/mylog.fixed

The first sed command captures lines 1 to 3099 and the second gets lines 3101 to the end of the file. In other words, mylog.fixed is the same as the original file, minus line 3100.

Verify the File

The sizes of the files look right, and a diff shows that the only difference between them is the corrupted line:

$ ls -l /tmp/mylog /tmp/mylog.fixed 
-rw-r--r-- 1 fred fred 8508691 May 3 09:13 /tmp/mylog
-rw-r--r-- 1 fred fred 8508061 May 3 09:17 /tmp/mylog.fixed
$ diff /tmp/mylog /tmp/mylog.fixed
3100d3099
< ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@141.8.142.73 - - [29/Apr/2019:13:01:04 +0100] "GET /robots.txt HTTP/1.1" 200 386 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"

Install Repaired File

The repaired file could now be moved into place. However, this is quite a busy web server and the log is being written to constantly. In order to avoid problems with the writing process (an apache2 daemon), Apache will be stopped briefly.

Another issue is that during the few minutes it took to carry out the above procedures and confirm the issue, many new lines have been written to the live log. We don’t want to lose those. So the file will be repaired as follows. First, check that the error is still there (and hasn’t been moved by logrotate), then stop Apache, fix the file, and re-start:

$ sed -n '3100p' access.log | cat -v
@^@^@^@^@^@^@^@^@^@^@141.8.142.73 - - [29/Apr/2019:13:01:04 +0100] "GET /robots.txt HTTP/1.1" 200 386 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"

The error is still there, at line 3100 (many ^@ characters have been removed, for clarity). Using sed again, we can repair the live file, using the same technique discussed above. Apache is then stopped while the log is replaced. NB. the following commands are executed as root.

# sed -n '1,3099p' access.log > /tmp/mylog.fixed
# sed -n '3101,$p' access.log >> /tmp/mylog.fixed
# service apache2 stop
# cp /tmp/mylog.fixed access.log
# service apache2 start

Verify the Repaired File

Cat -v can again be used to verify that the corrupted line has been removed:

$ cat -v access.log > /tmp/newlog
$ diff access.log /tmp/newlog
(no output, except for a few lines appended to the live file)

Conclusion

It is possible to find non-printing characters in other ways. In hindsight, now that we know what we are looking for, we could have detected the problem with grep -a, which treats all input files as text. For example:

$ grep -a '\^@' /tmp/mylog
^@^@^@^@^@^@^@^@^@ (...etc)
$ grep -a '\^' /tmp/mylog
^@^@^@^@^@^@^@^@^@ (...etc)

A “grep -a -v” could then have been used to remove the problematic line. The drawback with solutions involving grep, (or sed, perl) in combination with fancy search patterns is that if they return nothing, you are never sure whether there is nothing to be found, your regular expression was wrong, the escape sequence failed or the command just operates a different way.

Cat -v, on the other hand, is quick, simple, and very reliable.

Root Cause

Incidentally, the error was caused by a power cut that happened a few minutes before 29/Apr/2019:13:01:04, the timestamp shown on the corrupted log line.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.