Cron Apache Access Log Keep Certain Lines Remove Duplicates

I'm trying to create a nightly cron that keeps certain lines from the access_log (into a file), then removes (near) duplicates from that file, then empties the access log and restarts Apache.

I have no idea how to go about it, either the cron or the script to run. The access log is likely to be large, so I'm looking for the 'least expensive' commands to run in the script kicked off by cron.

The server is CentOS running Apache.

So with an access_log like:

11.11.11.11 [11/09/15 10:01:18] GET /file.txt
22.22.22.22 [11/09/15 10:11:18] GET /index.php
11.11.11.11 [11/09/15 10:21:18] GET /file.txt
33.33.33.33 [11/09/15 10:31:18] GET /file.txt
44.44.44.44 [11/09/15 10:41:18] GET /file.txt

Line 1 and 3 are near duplicates with the time being the only difference.. I'm only wanting to keep one instance of this, so the output file would be:

11.11.11.11 [11/09/15 10:01:18]
33.33.33.33 [11/09/15 10:31:18]
44.44.44.44 [11/09/15 10:41:18]

Something like this..?

#!/bin/bash

# Get matching lines from access_log into tmp file
cat /var/log/httpd/access_log | grep file.txt > tmp

# Remove near duplicates from tmp file
# This is where I'm having problems..
# I can't get sort, uniq or awk to work correctly
sort -buk1,1 tmp >> somefile.txt

# Remove tmp file and access_log
rm -f tmp /var/log/httpd/access_log

# Restart Apache to regenerate the access_log.
/etc/init.d/httpd restart

...I think awk and sed may be too expensive on a large file (?). I'm looking for the most efficient way to end up with the example result. I'm not wanting to use perl or python.

It seems like the IP should be a field or column in an array to compare against so as to remove the near duplicates, but there might be simpler way?

Would sort or uniq be correct? If so, may I have an example please?

I'll get the cron part figured out (although if you want to provide an example it would help)... The main part is removing the near duplicates.

Thanks in advance, and sorry about the poor title and example.

0
задан 9 November 2015 в 22:30
1 ответ

Вместо sort , вы можете использовать uniq и ограничить сравнение до первых N символов с помощью опции -w.

Поскольку IP-адрес состоит из 11 символов, команда будет выглядеть так:

uniq -w 11 tmp  >> somefile.txt
0
ответ дан 5 December 2019 в 11:37

Теги

Похожие вопросы