Apache Log Counting using Awk and Sed

Since I (and you as a visitor) don’t want your IP-address to be spread around the internet, I’ve anonymized the log data. It’s a fairly easy process that is done in 2 steps:

IP’s are translated into random values.
Admin url’s are removed.

Step 1: Translating IP’s

All the IP’s are translated into random IP’s, but every IP has it’s own random counterpart. This means that you can still identify users who are browsing through the site. The actual command I have used for this is:

cat apache-anon-noadmin.log | awk 'function ri(n) {  return int(n*rand()); }  \
BEGIN { srand(); }  { if (! ($1 in randip)) {  \
randip[$1] = sprintf("%d.%d.%d.%d", ri(255), ri(255), ri(255), ri(255)); } \
$1 = randip[$1]; print $0  }'

If you read a bit further we will find out what this will actually do, but most of it you should be able to understand (a least the global format).

Step 2: Removing admin url’s

I don’t like that everybody can view all the admin-requests I’ve done on the site. Luckely this is a very simple process. We only have to remove the requests that start with “/wp-admin”. This can be done by an inverse grep-command:

cat apache-anon.log | grep -v '/wp-admin' > apache-anon-noadmin.log

Example 1: count http status codes

For now we want to deal with the status-codes. This is found in field $9. The following code will print every field 9 for every record from our log:

cat apache-anon-noadmin.log | awk ' { print $9 } '

That’s nice, but let’s aggregate this data. We want to know how many times we outputted each status code. By using the “uniq” command, we can count (and display) the number of times we encounter data, but before we can use uniq we have to sort the data since uniq will stop counting as soon as another piece of data is encountered. (try the following line with and without the “sort” to see what I mean).

cat apache-anon-noadmin.log | awk ' { print $9 } ' | sort | uniq -c

And the output should be:

As you see, the 200 (which stands for OK), is returned 72951 times, while we returned 2133 times a 404 (page not found). Cool…

Example 2: top 10 of visiting ip’s

Let’s try to create some top-10?s. The first one about the IP’s that did the most pageviews (my fans, but most probably it would be me :p)

cat apache-anon-noadmin.log | awk '{ print $1 ; }' | \
sort | uniq -c | sort -n -r | head -n 10

We use awk to print the first field – the IP, we sort and count them. THEN we sort again, but this time in a reversed order and with a natural sort so 10 will be sorted after 9, instead of after 1. (again, remove the sort to find out what I mean). After this, we filter out the first 10 lines with the head command, which only prints the first 10 lines.

As you can see, I use (a lot of) different unix commands to achieve what I need to do. It MIGHT be possible to this all with awk itself as well, but by using other commands we get the job done quick and easy.

Example 3: traffic in kilobytes per status code

Let’s introduce arrays. Field $10 holds the number of bytes we have send out, and field $9 the status code. In the null-pattern (the block without any pattern, that will be executed on every line) we add the number of bytes to the array in the $9 index. It will NOT print out any information yet. At the end of the program, we will iterate over the “total”-array and print each status code, and the total sum of bytes / 1024, so we get kilobytes. Still, pretty easy to understand.

cat apache-anon-noadmin.log  | awk ' { total[$9] += $10 } \
END {  for (x in total) { printf "Status code %3d : %9.2f Kb\n", x, total[x]/1024 } } '
Status code 200 : 329836.22 Kb
Status code 206 :   4649.29 Kb
Status code 301 :    535.72 Kb
Status code 302 :     20.26 Kb
Status code 304 :    572.77 Kb
Status code 404 :   5106.29 Kb
Status code 500 :   2336.42 Kb

Not a lot of redirections, but still: 5 megabyte wasted by serving pages that are not found 🙁

Let’s expand this example so we get a total sum:

cat apache-anon-noadmin.log  | awk ' { totalkb += $10; total[$9] += $10 } \
END {  for (x in total) { printf "Status code %3d : %9.2f Kb\n", x, total[x]/1024 } \
printf ("\nTotal send      : %9.2f Kb\n", totalkb/1024); } '
Status code 200 : 329836.22 Kb
Status code 206 :   4649.29 Kb
Status code 301 :    535.72 Kb
Status code 302 :     20.26 Kb
Status code 304 :    572.77 Kb
Status code 404 :   5106.29 Kb
Status code 500 :   2336.42 Kb

Total send      : 343056.96 Kb

Example 4: top 10 referrers

We use the ” as separator here. We need this because the referrer is inside those quotes. This is how we can deal with request-url’s, the referrers and user-agents without problems. This time we don’t use a BEGIN block to change the FS-variable, but we change it through a command line parameter. Now, most of the referrers are either from our own blog, or a ‘-’, when no referrer is given. We add additional grep commands to remove those referrers. Again, sorting, doing a unique count, reverse nat sorting and limiting with head gives us a nice result:

cat apache-anon-noadmin.log | awk -F\" ' { print $4 } ' | \
grep -v '-' | grep -v 'http://www.adayinthelife' | sort | \
uniq -c | sort -rn | head -n 10
 343 http://www.phpdeveloper.org/news/15544
 175 http://www.dzone.com/links/rss/top5_certifications_for_every_php_programmer.html
 71 http://www.dzone.com/links/index.html
 64 http://www.google.com/reader/view/
 54 http://www.phpdeveloper.org/
 50 http://phpdeveloper.org/
 49 http://www.dzone.com/links/r/top5_certifications_for_every_php_programmer.html
 45 http://www.phpdeveloper.org/news/15544?utm_source=twitterfeed&utm_medium=twitter
 22 http://abcphp.com/41578/
 21 http://twitter.com

At least I can see quickly to which sites I need to send some christmas cards to.

Example 5: top 10 user-agents

How simple is this? The user-agent is in column 6 instead of 4 and we don’t need the grep’s, so this one needs no explanation:

cat apache-anon-noadmin.log | awk -F\" ' { print $6 } ' | \
sort | uniq -c | sort -rn | head -n 10
 5891 Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.215 Safari/534.10
 4145 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12
 3440 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.215 Safari/534.10
 2338 Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12
 2314 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.6) Gecko/2009011912 Firefox/3.0.6
 2001 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12
 1959 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.215 Safari/534.10
 1241 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4
 1122 Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.215 Safari/534.10
 1010 Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12

There are many different packages that allow you to generate reports on who’s visiting your site and what they’re doing. The most popular at this time appear to be “Analog”, “The Webalizer” and “AWStats” which are installed by default on many shared servers.

While such programs generate attractive reports, they only scratch the surface of what the log files can tell you. In this section we look at ways you can delve more deeply – focussing on the use of simple command line tools, particularly grep, awk and sed.

1. Combined log format

The following assumes an Apache HTTP Server combined log format where each entry in the log file contains the following information:

%h %l %u %t “%r” %>s %b “%{Referer}i” “%{User-agent}i”
where:

%h = IP address of the client (remote host) which made the request
%l = RFC 1413 identity of the client
%u = userid of the person requesting the document
%t = Time that the server finished processing the request
%r = Request line from the client in double quotes
%>s = Status code that the server sends back to the client
%b = Size of the object returned to the client
The final two items: Referer and User-agent give details on where the request originated and what type of agent made the request.

Sample log entries:

66.249.64.13 – – [18/Sep/2004:11:07:48 +1000] “GET /robots.txt HTTP/1.0” 200 468 “-” “Googlebot/2.1”
66.249.64.13 – – [18/Sep/2004:11:07:48 +1000] “GET / HTTP/1.0” 200 6433 “-” “Googlebot/2.1″
Note: The robots.txt file gives instructions to robots as to which parts of your site they are allowed to index. A request for / is a request for the default index page, normally index.html.

2. Using awk

The principal use of awk is to break up each line of a file into ‘fields’ or ‘columns’ using a pre-defined separator. Because each line of the log file is based on the standard format we can do many things quite easily.

Using the default separator which is any white-space (spaces or tabs) we get the following:

awk ‘{print $1}’ combined_log # ip address (%h)
awk ‘{print $2}’ combined_log # RFC 1413 identity (%l)
awk ‘{print $3}’ combined_log # userid (%u)
awk ‘{print $4,5}’ combined_log # date/time (%t)
awk ‘{print $9}’ combined_log # status code (%>s)
awk ‘{print $10}’ combined_log # size (%b)
You might notice that we’ve missed out some items. To get to them we need to set the delimiter to the ” character which changes the way the lines are ‘exploded’ and allows the following:

awk -F\” ‘{print $2}’ combined_log # request line (%r)
awk -F\” ‘{print $4}’ combined_log # referer
awk -F\” ‘{print $6}’ combined_log # user agent
Now that you understand the basics of breaking up the log file and identifying different elements, we can move on to more practical examples.

3. Examples

You want to list all user agents ordered by the number of times they appear (descending order):

awk -F\” ‘{print $6}’ combined_log | sort | uniq -c | sort -fr
All we’re doing here is extracing the user agent field from the log file and ‘piping’ it through some other commands. The first sort is to enable uniq to properly identify and count unique user agents. The final sort orders the result by number and name (both descending).

The result will look similar to a user agents report generated by one of the above-mentioned packages. The difference is that you can generate this ANY time from ANY log file or files.

If you’re not particulary interested in which operating system the visitor is using, or what browser extensions they have, then you can use something like the following:

awk -F\” ‘{print $6}’ combined_log \
| sed ‘s/($[^;]\+; [^;]\+$[^)]*)/(\1)/’ \
| sort | uniq -c | sort -fr
Note: The \ at the end of a line simply indicates that the command will continue on the next line.

This will strip out the third and subsequent values in the ‘bracketed’ component of the user agent string. For example:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR)
becomes:

Mozilla/4.0 (compatible; MSIE 6.0)
The next step is to start filtering the output so you can narrow down on a certain page or referer. Would you like to know which pages Google has been requesting from your site?

awk -F\” ‘($6 ~ /Googlebot/){print $2}’ combined_log | awk ‘{print $2}’
Or who’s been looking at your guestbook?

awk -F\” ‘($2 ~ /guestbook\.html/){print $6}’ combined_log
It’s just too easy isn’t it!

Using just the examples above you can already generate your own reports to back up any kind of automated reporting your ISP provides. You could even write your own log analysis program.

4. Using log files to identify problems with your site

The steps outlined below will let you identify problems with your site by identifying the different server responses and the requests that caused them:

awk ‘{print $9}’ combined_log | sort | uniq -c | sort
The output shows how many of each type of request your site is getting. A ‘normal’ request results in a 200 code which means a page or file has been requested and delivered but there are many other possibilities.

The most common responses are:

200 – OK
206 – Partial Content
301 – Moved Permanently
302 – Found
304 – Not Modified
401 – Unauthorised (password required)
403 – Forbidden
404 – Not Found
Note: For more on Status Codes you can read the article HTTP Server Status Codes.

A 301 or 302 code means that the request has been re-directed. What you’d like to see, if you’re concerned about bandwidth usage, is a lot of 304 responses – meaning that the file didn’t have to be delivered because they already had a cached version.

A 404 code may indicate that you have a problem – a broken internal link or someone linking to a page that no longer exists. You might need to fix the link, contact the site with the broken link, or set up a PURL so that the link can work again.

The next step is to identify which pages/files are generating the different codes. The following command will summarise the 404 (“Not Found”) requests:

# list all 404 requests
awk ‘($9 ~ /404/)’ combined_log

# summarise 404 requests
awk ‘($9 ~ /404/)’ combined_log | awk ‘{print $9,$7}’ | sort
Or, you can use an inverted regular expression to summarise the requests that didn’t return 200 (“OK”):

awk ‘($9 !~ /200/)’ combined_log | awk ‘{print $9,$7}’ | sort | uniq
Or, you can include (or exclude in this case) a range of responses, in this case requests that returned 200 (“OK”) or 304 (“Not Modified”):

awk ‘($9 !~ /200|304/)’ combined_log | awk ‘{print $9,$7}’ | sort | uniq
Suppose you’ve identifed a link that’s generating a lot of 404 errors. Let’s see where the requests are coming from:

awk -F\” ‘($2 ~ “^GET /path/to/brokenlink\.html”){print $4,$6}’ combined_log
Now you can see not just the referer, but the user-agent making the request. You should be able to identify whether there is a broken link within your site, on an external site, or if a search engine or similar agent has an invalid address.

If you can’t fix the link, you should look at using Apache mod_rewrite or a similar scheme to redirect (301) the requests to the most appropriate page on your site. By using a 301 instead of a normal (302) redirect you are indicating to search engines and other intelligent agents that they need to update their link as the content has ‘Moved Permanently’.

5. Who’s ‘hotlinking’ my images?

Something that really annoys some people is when their bandwidth is being used by their images being linked directly on other websites.

Here’s how you can see who’s doing this to your site. Just change www.example.net to your domain, and combined_log to your combined log file.

awk -F\” ‘($2 ~ /\.(jpg|gif)/ && $4 !~ /^http:\/\/www\.example\.net/){print $4}’ combined_log \
| sort | uniq -c | sort
Translation:

explode each row using “;
the request line (%r) must contain “.jpg” or “.gif”;
the referer must not start with your website address (www.example.net in this example);
display the referer and summarise.
You can block hot-linking using mod_rewrite but that can also result in blocking various search engine result pages, caches and online translation software. To see if this is happening, we look for 403 (“Forbidden”) errors in the image requests:

# list image requests that returned 403 Forbidden
awk ‘($9 ~ /403/)’ combined_log \
| awk -F\” ‘($2 ~ /\.(jpg|gif)/){print $4}’ \
| sort | uniq -c | sort
Translation:

the status code (%>s) is 403 Forbidden;
the request line (%r) contains “.jpg” or “.gif”;
display the referer and summarise.
You might notice that the above command is simply a combination of the previous, and one presented earlier. It is necessary to call awk more than once because the ‘referer’ field is only available after the separator is set to \”, wheras the ‘status code’ is available directly.

6. Blank User Agents

A ‘blank’ user agent is typically an indication that the request is from an automated script or someone who really values their privacy. The following command will give you a list of ip addresses for those user agents so you can decide if any need to be blocked:

awk -F\” ‘($6 ~ /^-?$/)’ combined_log | awk ‘{print $1}’ | sort | uniq
A further pipe through logresolve will give you the hostnames of those addresses.

View Apache requests per day

Run the following command to see requests per day:
awk ‘{print $4}’ rmohan.com | cut -d: -f1 | uniq -c
View Apache requests per hour

Run the following command to see requests per hour:
grep “23/Jan” rmohan.com | cut -d[ -f2 | cut -d] -f1 | awk -F: ‘{print $2″:00″}’ | sort -n | uniq -c
View Apache requests per minute

grep “23/Jan/2013:06″ rmohan.com | cut -d[ -f2 | cut -d] -f1 | awk -F: ‘{print $2″:”$3}’ | sort -nk1 -nk2 | uniq -c | awk ‘{ if ($1 > 10) print $0}’
1- Most Common 404s (Page Not Found)
cut -d'”‘ -f2,3 /var/log/apache/access.log | awk ‘$4=404{print $4” “$2}’ | sort | uniq -c | sort -rg

2 – Count requests by HTTP code

cut -d'”‘ -f3 /var/log/apache/access.log | cut -d’ ‘ -f2 | sort | uniq -c | sort -rg

3 – Largest Images
cut -d'”‘ -f2,3 /var/log/apache/access.log | grep -E ‘\.jpg|\.png|\.gif’ | awk ‘{print $5” “$2}’ | sort | uniq | sort -rg

4 – Filter Your IPs Requests
tail -f /var/log/apache/access.log | grep <your IP>

6 – Watch Crawlers Live
For this we need an extra file which we’ll call bots.txt. Here’s the contents:

Bot
Crawl
ai_archiver
libwww-perl
spider
Mediapartners-Google
slurp
wget
httrack

This just helps is to filter out common user agents used by crawlers.
Here’s the command:
tail -f /var/log/apache/access.log | grep -f bots.txt

7 – Top Crawlers
This command will show you all the spiders that crawled your site with a count of the number of requests.
cut -d'”‘ -f6 /var/log/apache/access.log | grep -f bots.txt | sort | uniq -c | sort -rg
How To Get A Top Ten
You can easily turn the commands above that aggregate (the ones using uniq) into a top ten by adding this to the end:
| head

That is pipe the output to the head command.
Simple as that.

Zipped Log Files
If you want to run the above commands on a logrotated file, you can adjust easily by starting with a zcat on the file then piping to the first command (the one with the filename).

Analyse an Apache access log for the most common IP addresses
Terminal – Analyse an Apache access log for the most common IP addresses
tail -10000 access_log | awk ‘{print $1}’ | sort | uniq -c | sort -n | tail
Terminal – Alternatives
zcat access_log.*.gz | awk ‘{print $7}’ | sort | uniq -c | sort -n | tail -n 20
awk ‘NR<=10000{a[$1]++}END{for (i in a) printf “%-6d %s\n”,a[i], i|”sort -n”}’ access.log

Print lines within a particular time range

awk ‘/01:05:/,/01:20:/’ access.log
Sort access log by response size (increasing)

awk –re-interval ‘{ match($0, /(([^[:space:]]+|\[[^\]]+\]|”[^”]+”)[[:space:]]+){7}/, m); print m[2], $0 }’ access.log|sort -nk 1

View TCP connection status
netstat-nat | awk ‘{print $ 6}’ | sort | uniq-c | sort-rn
netstat-n | awk ‘/ ^ tcp / {+ + S [$ NF]}; END {for (a in S) print a, S [a]}’
netstat-n | awk ‘/ ^ tcp / {+ + state [$ NF]}; END {for (key in state) print key, “/ t”, state [key]}’
netstat-n | awk ‘/ ^ tcp / {+ + arr [$ NF]}; END {for (k in arr) print k, “/ t”, arr [k]}’
netstat-n | awk ‘/ ^ tcp / {print $ NF}’ | sort | uniq-c | sort-rn
netstat-ant | awk ‘{print $ NF}’ | grep-v ‘[az]’ | sort | uniq-c
netstat-ant | awk ‘/ ip: 80 / {split ($ 5, ip, “:”); + + S [ip [1]]} END {for (a in S) print S [a], a}’ | sort-n
netstat-ant | awk ‘/: 80 / {split ($ 5, ip, “:”); + + S [ip [1]]} END {for (a in S) print S [a], a}’ | sort-rn | head-n 10
awk ‘BEGIN {printf (“http_code / tcount_num / n”)} {COUNT [$ 10] + +} END {for (a in COUNT) printf a “/ t / t” COUNT [a] “/ n”}’
Find the number of requests please 20 IP (commonly used to find the source of attack):
netstat-anlp | grep 80 | grep tcp | awk ‘{print $ 5}’ | awk-F: ‘{print $ 1}’ | sort | uniq-c | sort-nr | head-n20
netstat-ant | awk ‘/: 80 / {split ($ 5, ip, “:”); + + A [ip [1]]} END {for (i in A) print A [i], i}’ | sort-rn | head-n20
With tcpdump the sniffer port 80 access to see who the highest
tcpdump-i eth0-tnn dst port 80-c 1000 | awk-F “.” ‘{print $ 1 “.” $ 2 “.” $ 3 “.” $ 4}’ | sort | uniq-c | sort-nr | head – 20
4. Find more time_wait to connect
netstat-n | grep TIME_WAIT | awk ‘{print $ 5}’ | sort | uniq-c | sort-rn | head-n20
Find check more SYN connection.
netstat-an | grep SYN | awk ‘{print $ 5}’ | awk-F: ‘{print $ 1}’ | sort | uniq-c | sort-nr | more
6 according to the port column process
netstat-ntlp | grep 80 | awk ‘{print $ 7}’ | cut-d /-f1
Web logs (Apache):
1 ip address to gain access to the top 10
cat access.log | awk ‘{print $ 1}’ | sort | uniq-c | sort-nr | head -10
cat access.log | awk ‘{counts [$ (11)] + = 1}; END {for (url in counts) print counts [url], url}’
2 the Most Visited file or page, take the top 20 and statistics all IP
cat access.log | awk ‘{print $ 11}’ | sort | uniq-c | sort-nr | head -20
awk ‘{print $ 1}’ access.log | sort-n-r | uniq-c | wc-l
3 List the transmission exe file (analysis commonly used when the download station)
cat access.log | awk ‘($ 7 ~ / /. exe /) {print $ 10 “” $ 1 “” $ 4 “” $ 7}’ | sort-nr | head -20
Exe file and the corresponding file occurrences List output is greater than 200000byte (about 200kb)
cat access.log | awk ‘($ 10> 200000 && $ 7 ~ / /. exe /) {print $ 7}’ | sort-n | uniq-c | sort-nr | head -100
If the log of a page file transfer time, the page lists the most time-consuming to client
cat access.log | awk ‘($ 7 ~ / /. php /) {print $ NF “” $ 1 “” $ 4 “” $ 7}’ | sort-nr | head -100
6 page lists the most time-consuming (more than 60 seconds) as well as the frequency and the corresponding page
cat access.log | awk ‘($ NF> 60 && $ 7 ~ / /. php /) {print $ 7}’ | sort-n | uniq-c | sort-nr | head -100
List file transmission time over 30 seconds
cat access.log | awk ‘($ NF> 30) {print $ 7}’ | sort-n | uniq-c | sort-nr | head -20
Statistics website traffic (G)
cat access.log | awk ‘{sum + = $ 10} END {print sum/1024/1024/1024}’
Statistics 404 connections 9.
awk ‘($ 9 ~ / 404 /)’ access.log | awk ‘{print $ 9, $ 7}’ | sort
10 Statistics http status.
cat access.log | awk ‘{counts [$ (9)] + = 1}; END {for (code in counts) print code, counts [code]}’
cat access.log | awk ‘{print $ 9}’ | sort | uniq-c | sort-rn
11 concurrent per second:
awk ‘{if ($ 9 ~ / 200 | 30 | 404 /) COUNT [$ 4] + +} END {for (a in COUNT) print a, COUNT [a]}’ | sort-k 2-nr | head-n10
12. Bandwidth statistics
cat apache.log | awk ‘{if ($ 7 ~ / GET /) count + +} END {print “client_request =” count}’
cat apache.log | awk ‘{BYTE + = $ 11} END {print “client_kbyte_out =” BYTE/1024 “KB”}’
Average size of 13. Statistical number of objects and object
cat access.log | awk ‘{byte + = $ 10} END {print byte/NR/1024, NR}’
cat access.log | awk ‘{if ($ 9 ~ / 200 | 30 /) COUNT [$ NF] + +} END {for (a in COUNT) print a, COUNT
[A], NR, COUNT [a] / NR * 100 “%”}
14 to take a 5-minute log
if [$ DATE_MINUTE! = $ DATE_END_MINUTE]; then # determine the start timestamp and end timestamps are equal START_LINE = `sed-n” / $ DATE_MINUTE / = “$ APACHE_LOG | head-n1` # if not equal, then remove the the line numbers start timestamp and the end timestamp line number
# END_LINE = `sed-n” / $ DATE_END_MINUTE / = “$ APACHE_LOG | tail-n1`
END_LINE = `sed-n” / $ DATE_END_MINUTE / = “$ APACHE_LOG | head-n1` sed-n “$ {START_LINE}, $ {END_LINE} p” $ APACHE_LOG> $ MINUTE_LOG # # by line number, remove the 5 minutes the contents of the log is stored into a temporary file
GET_START_TIME = `sed-n” $ {START_LINE} p “$ APACHE_LOG | awk-F ‘[‘ ‘{print $ 2}’ | awk ‘{print $ 1}’ |
sed ‘s # / # # g’ | sed ‘s #: # #’ `# remove timestamp obtained by the line number
GET_END_TIME = `sed-n” $ {END_LINE} p “$ APACHE_LOG | awk-F ‘[‘ ‘{print $ 2}’ | awk ‘{print $ 1}’ | sed
‘S # / # # g’ | sed ‘s #: # #’ `# line number to get the end timestamp
15 Spiders analysis
See which spiders crawl the content
/ Usr / sbin / tcpdump-i eth0-l-s 0-w – dst port 80 | strings | grep-i user-agent | grep-i-E ‘bot | crawler | slurp | spider’
Site on the analysis 2 (Squid papers)
A flow rate of 2 domain statistics
zcat squid_access.log.tar.gz | awk ‘{print $ 10, $ 7}’ | awk ‘BEGIN {FS = “[/]”} {trfc [$ 4] + = $ 1} END {for
(Domain in trfc) {printf “% s / t% d / n”, domain, trfc [domain]}} ‘
Efficient perl version, please download here:
Database articles
1 View sql database
/ Usr / sbin / tcpdump-i eth0-s 0-l-w – dst port 3306 | strings | egrep-i ‘SELECT | UPDATE | DELETE | INSERT | SET | COMMIT | ROLLBACK | CREATE | DROP | ALTER | CALL’
System Debug analyze articles
1 debug command
strace-p pid
Tracking the specified process PID
gdb-p PID

CHECKING FOR HIGH VISITS FROM A LIMITED NUMBER OF IPS

First locate the log file for your site. The generic log is generally at /var/log/httpd/access_log or/var/log/apache2/access_log (depending on your distro). For virtualhost-specific logs, check the conf files or (if you have one active site and others in the background) run ls -alt /var/log/httpd to see which file is most recently updated.

cat access.log | awk ‘{print $1}’ | sort | uniq -c | wc -l

2. Check out unique visitors today:-
cat access.log | grep `date ‘+%e/%b/%G’` | awk ‘{print $1}’ | sort | uniq -c | wc -l

3. Check out unique visitors this month:-
cat access.log | grep `date ‘+%b/%G’` | awk ‘{print $1}’ | sort | uniq -c | wc -l

4. Check out unique visitors on arbitrary nomber:-
cat access.log | grep 22/Mar/2013 | awk ‘{print $1}’ | sort | uniq -c | wc -l

5. Check out unique visitors Month of march:-
cat access.log | grep Mar/2013 | awk ‘{print $1}’ | sort | uniq -c | wc -l

6. Check out statistics of number of visit/request “visitors IP”-
cat access.log | awk ‘{print “requests from ” $1}’ | sort | uniq -c | sort

7. Check out statistics of number of visit/request with date-
cat access.log | grep 26/Mar/2013 | awk ‘{print “requests from ” $1}’ | sort | uniq -c | sort

8. Find out targets the last 5,000 hits:

tail -5000 access.log| awk ‘{print $1}’ | sort | uniq -c |sort -n

9. Finally, if you have a ton of domains you may want to use this to aggregate them:

for k in `ls –color=none`; do echo “Top visitors by ip for: $k”;awk ‘{print $1}’ ~/logs/$k/http/access.log|sort|uniq -c|sort -n|tail;done

10. This command is great if you want to see what is being called the most (that can often show you that a specific script is being abused if it’s being called way more times than anything else in the site):

awk ‘{print $7}’ access.log|cut -d? -f1|sort|uniq -c|sort -nk1|tail -n10

11. If you have multiple domains on and on a PS (PS only!) run this command to get all traffic for all domains on the PS:

for k in `ls -S /home/*/logs/*/http/access.log`; do wc -l $k | sort -r -n; done

12. Here is an alternative to the above command which does the same thing, this is for VPS only using an admin user:
sudo find /home/*/logs -type f -name “access.log” -exec wc -l “{}” \; | sort -r -n

13. If you’re on a shared server you can run this command which will do the same as the one above but just to the domains in your logs directory. You have to run this commands while your in your user’s logs directory:
for k in `ls -S */http/access.log`; do wc -l $k | sort -r -n; done

14. grep apache access.log and list IP’s by hits and date:-
grep Mar/2013 /var/log/apache2/access.log | awk ‘{ print $1 }’ | sort -n | uniq -c | sort -rn | head

15. Find out top referring URLs:

16. Check out top of ‘Page Not Found’s (404):
cut -d'”‘ -f2,3 /var/log/apache/access.log | awk ‘$4=404{print $4″ “$2}’ | sort | uniq -c | sort -rg

17. Check out top largest images:-
cut -d'”‘ -f2,3 /var/log/apache/access.log | grep -E ‘\.jpg|\.png|\.gif’ | awk ‘{print $5” “$2}’ | sort | uniq | sort -rg
18. Check out server response code:-
cut -d'”‘ -f3 /var/log/apache/access.log | cut -d’ ‘ -f2 | sort | uniq -c | sort -rg
19. Check out Apache request per day
awk ‘{print $4}’ access.log | cut -d: -f1 | uniq -c
20. Check out Apache request per houre.
grep “6/May” access.log | cut -d[ -f2 | cut -d] -f1 | awk -F: ‘{print $2″:00″}’ | sort -n | uniq -c
21. Check out Apache request per minute.
grep “6/May/2013:06″ access.log | cut -d[ -f2 | cut -d] -f1 | awk -F: ‘{print $2”:”$3}’ | sort -nk1 -nk2 | uniq -c | awk ‘{ if ($1 > 10) print $0}’
22. All those commands can be easily run on a log-rotated file:
zcat /var/log/apache/access.log.1.gz | cut -d'”‘ -f3 | cut -d’ ‘ -f2 | sort | uniq -c | sort -rg

Example:
Grep log log analysis collate finishing
1 . Analyze log files to access the page next 2012-05-04 The top 20 URL and sorting
cat access.log | grep ’04 / May/2012 ‘| awk’ {print $ 11} ‘| sort | uniq-c | sort-nr | head -20
Query the URL address to access the page URL contains the IP address of www.abc.com
cat access_log | awk ‘($ 11 ~ / \ www.abc.com/) {print $ 1}’ | sort | uniq-c | sort-nr
(2) to gain access to up to 10 IP addresses can also be queried by time
cat linewow-access.log | awk ‘{print $ 1}’ | sort | uniq-c | sort-nr | head -10
1 to gain access to the ip address before 10
cat access.log | awk ‘{print $ 1}’ | sort | uniq-c | sort-nr | head -10
cat access.log | awk ‘{counts [$ (11)] + = 1}; END {for (url in counts) print counts [url], url}’
2 Most Visited file or page , take the top 20 and all access to IP Statistics
cat access.log | awk ‘{print $ 11}’ | sort | uniq-c | sort-nr | head -20
awk ‘{print $ 1}’ access.log | sort-n-r | uniq-c | wc-l
cat wangsu.log | egrep ’06 / Sep/2012: 14:35 | 06/Sep/2012: 15:05 ‘| awk’ {print $ 1} ‘| sort | uniq-c | sort-nr | head -10 query log period of time the situation
3 lists some of the largest transfer exe file ( download station when analyzing common )
cat access.log | awk ‘($ 7 ~ / \. exe /) {print $ 10 “” $ 1 “” $ 4 “” $ 7}’ | sort-nr | head -20
4 lists the output is greater than 200000byte ( about 200kb) an exe file and the number of occurrences of the corresponding file
cat access.log | awk ‘($ 10> 200000 && $ 7 ~ / \. exe /) {print $ 7}’ | sort-n | uniq-c | sort-nr | head -100
5 If the log records the last one is the page file transfer time , there are lists to the client the most time-consuming page
cat access.log | awk ‘($ 7 ~ / \. php /) {print $ NF “” $ 1 “” $ 4 “” $ 7}’ | sort-nr | head -100
6 lists the most time-consuming page ( more than 60 seconds ) as well as the corresponding page number of occurrences
cat access.log | awk ‘($ NF> 60 && $ 7 ~ / \. php /) {print $ 7}’ | sort-n | uniq-c | sort-nr | head -100
7 lists the transmission of documents longer than 30 seconds
cat access.log | awk ‘($ NF> 30) {print $ 7}’ | sort-n | uniq-c | sort-nr | head -20
8 Statistics website traffic (G)
cat access.log | awk ‘{sum + = $ 10} END {print sum/1024/1024/1024}’
9 Statistics 404 connection
awk ‘($ 9 ~ / 404 /)’ access.log | awk ‘{print $ 9, $ 7}’ | sort
10 Statistical http status.
cat access.log | awk ‘{counts [$ (9)] + = 1}; END {for (code in counts) print code, counts [code]}’
cat access.log | awk ‘{print $ 9}’ | sort | uniq-c | sort-rn
11 sec Concurrency :
awk ‘{if ($ 9 ~ / 200 | 30 | 404 /) COUNT [$ 4] + +} END {for (a in COUNT) print a, COUNT [a]}’ | sort-k 2-nr | head-n10
12 . Bandwidth statistics
cat apache.log | awk ‘{if ($ 7 ~ / GET /) count + +} END {print “client_request =” count}’
cat apache.log | awk ‘{BYTE + = $ 11} END {print “client_kbyte_out =” BYTE/1024 “KB”}’
One day out of the 10 most visited IP
cat / tmp / access.log | grep “20/Mar/2011” | awk ‘{print $ 3}’ | sort | uniq-c | sort-nr | head
Maximum number of connections that day ip ip are doing :
cat access.log | grep “10.0.21.17” | awk ‘{print $ 8}’ | sort | uniq-c | sort-nr | head-n 10
Find out the most visited several minutes
awk ‘{print $ 1}’ access.log | grep “20/Mar/2011” | cut-c 14-18 | sort | uniq-c | sort-nr | head
Attachment: View tcp connection status
netstat-nat | awk ‘{print $ 6}’ | sort | uniq-c | sort-rn
netstat-n | awk ‘/ ^ tcp / {+ + S [$ NF]}; END {for (a in S) print a, S [a]}’
netstat-n | awk ‘/ ^ tcp / {+ + state [$ NF]}; END {for (key in state) print key, “\ t”, state [key]}’
netstat-n | awk ‘/ ^ tcp / {+ + arr [$ NF]}; END {for (k in arr) print k, “\ t”, arr [k]}’
netstat-n | awk ‘/ ^ tcp / {print $ NF}’ | sort | uniq-c | sort-rn
netstat-ant | awk ‘{print $ NF}’ | grep-v ‘[az]’ | sort | uniq-c
netstat-ant | awk ‘/ ip: 80 / {split ($ 5, ip, “:”); + + S [ip [1]]} END {for (a in S) print S [a], a}’ | sort-n
netstat-ant | awk ‘/: 80 / {split ($ 5, ip, “:”); + + S [ip [1]]} END {for (a in S) print S [a], a}’ | sort-rn | head-n 10
awk ‘BEGIN {printf (“http_code \ tcount_num \ n”)} {COUNT [$ 10] + +} END {for (a in COUNT) printf a “\ t \ t” COUNT [a] “\ n”}’
(2) Find requests please 20 IP ( commonly used in the attack source lookup ) :
netstat-anlp | grep 80 | grep tcp | awk ‘{print $ 5}’ | awk-F: ‘{print $ 1}’ | sort | uniq-c | sort-nr | head-n20
netstat-ant | awk ‘/: 80 / {split ($ 5, ip, “:”); + + A [ip [1]]} END {for (i in A) print A [i], i}’ | sort-rn | head-n20
3 with a sniffer tcpdump port 80 access to see who the highest
tcpdump-i eth0-tnn dst port 80-c 1000 | awk-F “.” ‘{print $ 1 “.” $ 2 “.” $ 3 “.” $ 4}’ | sort | uniq-c | sort-nr | head – 20
4 Find more time_wait connection
netstat-n | grep TIME_WAIT | awk ‘{print $ 5}’ | sort | uniq-c | sort-rn | head-n20
5 more investigation to find SYN connections
netstat-an | grep SYN | awk ‘{print $ 5}’ | awk-F: ‘{print $ 1}’ | sort | uniq-c | sort-nr | more
6 According to port out process
netstat-ntlp | grep 80 | awk ‘{print $ 7}’ | cut-d /-f1

Recent Posts

Pages

Categories

Archives

Recent Comments