{"id":2786,"date":"2014-02-10T23:26:26","date_gmt":"2014-02-10T15:26:26","guid":{"rendered":"http:\/\/rmohan.com\/?p=2786"},"modified":"2014-07-23T23:34:22","modified_gmt":"2014-07-23T15:34:22","slug":"apache-log-counting-using-awk-and-seed","status":"publish","type":"post","link":"https:\/\/mohan.sg\/?p=2786","title":{"rendered":"Apache Log Counting using Awk and Sed"},"content":{"rendered":"<h2><span style=\"line-height: 1.5em; font-size: 14px;\">Since I (and you as a visitor) don\u2019t want your IP-address to be spread around the internet, I\u2019ve anonymized the log data. It\u2019s a fairly easy process that is done in 2 steps:<\/span><\/h2>\n<ol>\n<li>IP\u2019s are translated into random values.<\/li>\n<li>Admin url\u2019s are removed.<\/li>\n<\/ol>\n<h3>Step 1: Translating IP\u2019s<\/h3>\n<p>All the IP\u2019s are translated into random IP\u2019s, but every IP has it\u2019s own random counterpart. This means that you can still identify users who are browsing through the site. The actual command I have used for this is:<\/p>\n<pre>cat apache-anon-noadmin.log | awk 'function ri(n) {  return int(n*rand()); }\u00a0 \\\r\nBEGIN { srand(); }\u00a0 { if (! ($1 in randip)) {  \\\r\nrandip[$1] = sprintf(\"%d.%d.%d.%d\", ri(255), ri(255), ri(255), ri(255)); } \\\r\n$1 = randip[$1]; print $0\u00a0 }'<\/pre>\n<p>If you read a bit further we will find out what this will actually do, but most of it you should be able to understand (a least the global format).<\/p>\n<h3>Step 2: Removing admin url\u2019s<\/h3>\n<p>I don\u2019t like that everybody can view all the admin-requests I\u2019ve done on the site. Luckely this is a very simple process. We only have to remove the requests that start with \u201c\/wp-admin\u201d. This can be done by an inverse grep-command:<\/p>\n<pre>cat apache-anon.log | grep -v '\/wp-admin' &gt; apache-anon-noadmin.log<\/pre>\n<p>&nbsp;<\/p>\n<h2>Example 1: count http status codes<\/h2>\n<p>For now we want to deal with the status-codes. This is found in field $9. The following code will print every field 9 for every record from our log:<\/p>\n<pre>cat apache-anon-noadmin.log | awk ' { print $9 } '<\/pre>\n<p>That\u2019s nice, but let\u2019s aggregate this data. We want to know how many times we outputted each status code. By using the \u201cuniq\u201d command, we can count (and display) the number of times we encounter data, but before we can use uniq we have to sort the data since uniq will stop counting as soon as another piece of data is encountered. (try the following line with and without the \u201csort\u201d to see what I mean).<\/p>\n<pre>cat apache-anon-noadmin.log | awk ' { print $9 } ' | sort | uniq -c<\/pre>\n<p>And the output should be:<\/p>\n<pre>72951 200\r\n  235 206\r\n 1400 301\r\n   38 302\r\n 2911 304\r\n 2133 404\r\n 1474 500<\/pre>\n<p>As you see, the 200 (which stands for OK), is returned 72951 times, while we returned 2133 times a 404 (page not found). Cool\u2026<\/p>\n<p>&nbsp;<\/p>\n<h2>Example 2: top 10 of visiting ip\u2019s<\/h2>\n<p>Let\u2019s try to create some top-10?s. The first one about the IP\u2019s that did the most pageviews (my fans, but most probably it would be me :p)<\/p>\n<pre>cat apache-anon-noadmin.log | awk '{ print $1 ; }' | \\\r\nsort | uniq -c | sort -n -r | head -n 10<\/pre>\n<p>We use awk to print the first field \u2013 the IP, we sort and count them. THEN we sort again, but this time in a reversed order and with a natural sort so 10 will be sorted after 9, instead of after 1. (again, remove the sort to find out what I mean). After this, we filter out the first 10 lines with the head command, which only prints the first 10 lines.<\/p>\n<p>As you can see, I use (a lot of) different unix commands to achieve what I need to do. It MIGHT be possible to this all with awk itself as well, but by using other commands we get the job done quick and easy.<\/p>\n<p>&nbsp;<\/p>\n<h2>Example 3: traffic in kilobytes per status code<\/h2>\n<p>Let\u2019s introduce arrays. Field $10 holds the number of bytes we have send out, and field $9 the status code. In the null-pattern (the block without any pattern, that will be executed on every line) we add the number of bytes to the array in the $9 index. It will NOT print out any information yet. At the end of the program, we will iterate over the \u201ctotal\u201d-array and print each status code, and the total sum of bytes \/ 1024, so we get kilobytes. Still, pretty easy to understand.<\/p>\n<pre>cat apache-anon-noadmin.log\u00a0 | awk ' { total[$9] += $10 } \\\r\nEND {\u00a0 for (x in total) { printf \"Status code %3d : %9.2f Kb\\n\", x, total[x]\/1024 } } '\r\nStatus code 200 : 329836.22 Kb\r\nStatus code 206 :\u00a0  4649.29 Kb\r\nStatus code 301 :\u00a0\u00a0  535.72 Kb\r\nStatus code 302 :\u00a0\u00a0 \u00a0 20.26 Kb\r\nStatus code 304 :\u00a0\u00a0  572.77 Kb\r\nStatus code 404 :\u00a0  5106.29 Kb\r\nStatus code 500 :\u00a0  2336.42 Kb<\/pre>\n<p>Not a lot of redirections, but still: 5 megabyte wasted by serving pages that are not found \ud83d\ude41<\/p>\n<p>Let\u2019s expand this example so we get a total sum:<\/p>\n<pre>cat apache-anon-noadmin.log\u00a0 | awk ' { totalkb += $10; total[$9] += $10 } \\\r\nEND {\u00a0 for (x in total) { printf \"Status code %3d : %9.2f Kb\\n\", x, total[x]\/1024 } \\\r\nprintf (\"\\nTotal send\u00a0\u00a0\u00a0\u00a0\u00a0 : %9.2f Kb\\n\", totalkb\/1024); } '\r\nStatus code 200 : 329836.22 Kb\r\nStatus code 206 :\u00a0\u00a0 4649.29 Kb\r\nStatus code 301 :\u00a0\u00a0\u00a0 535.72 Kb\r\nStatus code 302 :\u00a0\u00a0\u00a0\u00a0 20.26 Kb\r\nStatus code 304 :\u00a0\u00a0\u00a0 572.77 Kb\r\nStatus code 404 :\u00a0\u00a0 5106.29 Kb\r\nStatus code 500 :\u00a0\u00a0 2336.42 Kb\r\n\r\nTotal send\u00a0\u00a0\u00a0\u00a0\u00a0 : 343056.96 Kb<\/pre>\n<p>&nbsp;<\/p>\n<h2>Example 4: top 10 referrers<\/h2>\n<p>We use the \u201d as separator here. We need this because the referrer is inside those quotes. This is how we can deal with request-url\u2019s, the referrers and user-agents without problems. This time we don\u2019t use a BEGIN block to change the FS-variable, but we change it through a command line parameter. Now, most of the referrers are either from our own blog, or a \u2018-\u2019, when no referrer is given. We add additional grep commands to remove those referrers. Again, sorting, doing a unique count, reverse nat sorting and limiting with head gives us a nice result:<\/p>\n<pre>cat apache-anon-noadmin.log | awk -F\\\" ' { print $4 } ' | \\\r\ngrep -v '-' | grep -v 'http:\/\/www.adayinthelife' | sort | \\\r\nuniq -c | sort -rn | head -n 10\r\n 343 http:\/\/www.phpdeveloper.org\/news\/15544\r\n 175 http:\/\/www.dzone.com\/links\/rss\/top5_certifications_for_every_php_programmer.html\r\n 71 http:\/\/www.dzone.com\/links\/index.html\r\n 64 http:\/\/www.google.com\/reader\/view\/\r\n 54 http:\/\/www.phpdeveloper.org\/\r\n 50 http:\/\/phpdeveloper.org\/\r\n 49 http:\/\/www.dzone.com\/links\/r\/top5_certifications_for_every_php_programmer.html\r\n 45 http:\/\/www.phpdeveloper.org\/news\/15544?utm_source=twitterfeed&amp;utm_medium=twitter\r\n 22 http:\/\/abcphp.com\/41578\/\r\n 21 http:\/\/twitter.com<\/pre>\n<p>At least I can see quickly to which sites I need to send some christmas cards to.<\/p>\n<p>&nbsp;<\/p>\n<h2>Example 5: top 10 user-agents<\/h2>\n<p>How simple is this? The user-agent is in column 6 instead of 4 and we don\u2019t need the grep\u2019s, so this one needs no explanation:<\/p>\n<pre>cat apache-anon-noadmin.log | awk -F\\\" ' { print $6 } ' | \\\r\nsort | uniq -c | sort -rn | head -n 10\r\n 5891 Mozilla\/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit\/534.10 (KHTML, like Gecko) Chrome\/8.0.552.215 Safari\/534.10\r\n 4145 Mozilla\/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.12) Gecko\/20101026 Firefox\/3.6.12\r\n 3440 Mozilla\/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit\/534.10 (KHTML, like Gecko) Chrome\/8.0.552.215 Safari\/534.10\r\n 2338 Mozilla\/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.12) Gecko\/20101026 Firefox\/3.6.12\r\n 2314 Mozilla\/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.6) Gecko\/2009011912 Firefox\/3.0.6\r\n 2001 Mozilla\/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko\/20101026 Firefox\/3.6.12\r\n 1959 Mozilla\/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-US) AppleWebKit\/534.10 (KHTML, like Gecko) Chrome\/8.0.552.215 Safari\/534.10\r\n 1241 Mozilla\/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleWebKit\/533.19.4 (KHTML, like Gecko) Version\/5.0.3 Safari\/533.19.4\r\n 1122 Mozilla\/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit\/534.10 (KHTML, like Gecko) Chrome\/8.0.552.215 Safari\/534.10\r\n 1010 Mozilla\/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko\/20101026 Firefox\/3.6.12<\/pre>\n<p>There are many different packages that allow you to generate reports on who&#8217;s visiting your site and what they&#8217;re doing. The most popular at this time appear to be &#8220;Analog&#8221;, &#8220;The Webalizer&#8221; and &#8220;AWStats&#8221; which are installed by default on many shared servers.<\/p>\n<p>While such programs generate attractive reports, they only scratch the surface of what the log files can tell you. In this section we look at ways you can delve more deeply &#8211; focussing on the use of simple command line tools, particularly grep, awk and sed.<\/p>\n<p>1. Combined log format<\/p>\n<p>The following assumes an Apache HTTP Server combined log format where each entry in the log file contains the following information:<\/p>\n<p>%h %l %u %t &#8220;%r&#8221; %&gt;s %b &#8220;%{Referer}i&#8221; &#8220;%{User-agent}i&#8221;<br \/>\nwhere:<\/p>\n<p>%h = IP address of the client (remote host) which made the request<br \/>\n%l = RFC 1413 identity of the client<br \/>\n%u = userid of the person requesting the document<br \/>\n%t = Time that the server finished processing the request<br \/>\n%r = Request line from the client in double quotes<br \/>\n%&gt;s = Status code that the server sends back to the client<br \/>\n%b = Size of the object returned to the client<br \/>\nThe final two items: Referer and User-agent give details on where the request originated and what type of agent made the request.<\/p>\n<p>Sample log entries:<\/p>\n<p>66.249.64.13 &#8211; &#8211; [18\/Sep\/2004:11:07:48 +1000] &#8220;GET \/robots.txt HTTP\/1.0&#8221; 200 468 &#8220;-&#8221; &#8220;Googlebot\/2.1&#8221;<br \/>\n66.249.64.13 &#8211; &#8211; [18\/Sep\/2004:11:07:48 +1000] &#8220;GET \/ HTTP\/1.0&#8221; 200 6433 &#8220;-&#8221; &#8220;Googlebot\/2.1&#8243;<br \/>\nNote: The robots.txt file gives instructions to robots as to which parts of your site they are allowed to index. A request for \/ is a request for the default index page, normally index.html.<\/p>\n<p>2. Using awk<\/p>\n<p>The principal use of awk is to break up each line of a file into &#8216;fields&#8217; or &#8216;columns&#8217; using a pre-defined separator. Because each line of the log file is based on the standard format we can do many things quite easily.<\/p>\n<p>Using the default separator which is any white-space (spaces or tabs) we get the following:<\/p>\n<p>awk &#8216;{print $1}&#8217; combined_log # ip address (%h)<br \/>\nawk &#8216;{print $2}&#8217; combined_log # RFC 1413 identity (%l)<br \/>\nawk &#8216;{print $3}&#8217; combined_log # userid (%u)<br \/>\nawk &#8216;{print $4,5}&#8217; combined_log # date\/time (%t)<br \/>\nawk &#8216;{print $9}&#8217; combined_log # status code (%&gt;s)<br \/>\nawk &#8216;{print $10}&#8217; combined_log # size (%b)<br \/>\nYou might notice that we&#8217;ve missed out some items. To get to them we need to set the delimiter to the &#8221; character which changes the way the lines are &#8216;exploded&#8217; and allows the following:<\/p>\n<p>awk -F\\&#8221; &#8216;{print $2}&#8217; combined_log # request line (%r)<br \/>\nawk -F\\&#8221; &#8216;{print $4}&#8217; combined_log # referer<br \/>\nawk -F\\&#8221; &#8216;{print $6}&#8217; combined_log # user agent<br \/>\nNow that you understand the basics of breaking up the log file and identifying different elements, we can move on to more practical examples.<\/p>\n<p>3. Examples<\/p>\n<p>You want to list all user agents ordered by the number of times they appear (descending order):<\/p>\n<p>awk -F\\&#8221; &#8216;{print $6}&#8217; combined_log | sort | uniq -c | sort -fr<br \/>\nAll we&#8217;re doing here is extracing the user agent field from the log file and &#8216;piping&#8217; it through some other commands. The first sort is to enable uniq to properly identify and count unique user agents. The final sort orders the result by number and name (both descending).<\/p>\n<p>The result will look similar to a user agents report generated by one of the above-mentioned packages. The difference is that you can generate this ANY time from ANY log file or files.<\/p>\n<p>If you&#8217;re not particulary interested in which operating system the visitor is using, or what browser extensions they have, then you can use something like the following:<\/p>\n<p>awk -F\\&#8221; &#8216;{print $6}&#8217; combined_log \\<br \/>\n| sed &#8216;s\/(\\([^;]\\+; [^;]\\+\\)[^)]*)\/(\\1)\/&#8217; \\<br \/>\n| sort | uniq -c | sort -fr<br \/>\nNote: The \\ at the end of a line simply indicates that the command will continue on the next line.<\/p>\n<p>This will strip out the third and subsequent values in the &#8216;bracketed&#8217; component of the user agent string. For example:<\/p>\n<p>Mozilla\/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR)<br \/>\nbecomes:<\/p>\n<p>Mozilla\/4.0 (compatible; MSIE 6.0)<br \/>\nThe next step is to start filtering the output so you can narrow down on a certain page or referer. Would you like to know which pages Google has been requesting from your site?<\/p>\n<p>awk -F\\&#8221; &#8216;($6 ~ \/Googlebot\/){print $2}&#8217; combined_log | awk &#8216;{print $2}&#8217;<br \/>\nOr who&#8217;s been looking at your guestbook?<\/p>\n<p>awk -F\\&#8221; &#8216;($2 ~ \/guestbook\\.html\/){print $6}&#8217; combined_log<br \/>\nIt&#8217;s just too easy isn&#8217;t it!<\/p>\n<p>Using just the examples above you can already generate your own reports to back up any kind of automated reporting your ISP provides. You could even write your own log analysis program.<\/p>\n<p>4. Using log files to identify problems with your site<\/p>\n<p>The steps outlined below will let you identify problems with your site by identifying the different server responses and the requests that caused them:<\/p>\n<p>awk &#8216;{print $9}&#8217; combined_log | sort | uniq -c | sort<br \/>\nThe output shows how many of each type of request your site is getting. A &#8216;normal&#8217; request results in a 200 code which means a page or file has been requested and delivered but there are many other possibilities.<\/p>\n<p>The most common responses are:<\/p>\n<p>200 &#8211; OK<br \/>\n206 &#8211; Partial Content<br \/>\n301 &#8211; Moved Permanently<br \/>\n302 &#8211; Found<br \/>\n304 &#8211; Not Modified<br \/>\n401 &#8211; Unauthorised (password required)<br \/>\n403 &#8211; Forbidden<br \/>\n404 &#8211; Not Found<br \/>\nNote: For more on Status Codes you can read the article HTTP Server Status Codes.<\/p>\n<p>A 301 or 302 code means that the request has been re-directed. What you&#8217;d like to see, if you&#8217;re concerned about bandwidth usage, is a lot of 304 responses &#8211; meaning that the file didn&#8217;t have to be delivered because they already had a cached version.<\/p>\n<p>A 404 code may indicate that you have a problem &#8211; a broken internal link or someone linking to a page that no longer exists. You might need to fix the link, contact the site with the broken link, or set up a PURL so that the link can work again.<\/p>\n<p>The next step is to identify which pages\/files are generating the different codes. The following command will summarise the 404 (&#8220;Not Found&#8221;) requests:<\/p>\n<p># list all 404 requests<br \/>\nawk &#8216;($9 ~ \/404\/)&#8217; combined_log<\/p>\n<p># summarise 404 requests<br \/>\nawk &#8216;($9 ~ \/404\/)&#8217; combined_log | awk &#8216;{print $9,$7}&#8217; | sort<br \/>\nOr, you can use an inverted regular expression to summarise the requests that didn&#8217;t return 200 (&#8220;OK&#8221;):<\/p>\n<p>awk &#8216;($9 !~ \/200\/)&#8217; combined_log | awk &#8216;{print $9,$7}&#8217; | sort | uniq<br \/>\nOr, you can include (or exclude in this case) a range of responses, in this case requests that returned 200 (&#8220;OK&#8221;) or 304 (&#8220;Not Modified&#8221;):<\/p>\n<p>awk &#8216;($9 !~ \/200|304\/)&#8217; combined_log | awk &#8216;{print $9,$7}&#8217; | sort | uniq<br \/>\nSuppose you&#8217;ve identifed a link that&#8217;s generating a lot of 404 errors. Let&#8217;s see where the requests are coming from:<\/p>\n<p>awk -F\\&#8221; &#8216;($2 ~ &#8220;^GET \/path\/to\/brokenlink\\.html&#8221;){print $4,$6}&#8217; combined_log<br \/>\nNow you can see not just the referer, but the user-agent making the request. You should be able to identify whether there is a broken link within your site, on an external site, or if a search engine or similar agent has an invalid address.<\/p>\n<p>If you can&#8217;t fix the link, you should look at using Apache mod_rewrite or a similar scheme to redirect (301) the requests to the most appropriate page on your site. By using a 301 instead of a normal (302) redirect you are indicating to search engines and other intelligent agents that they need to update their link as the content has &#8216;Moved Permanently&#8217;.<\/p>\n<p>5. Who&#8217;s &#8216;hotlinking&#8217; my images?<\/p>\n<p>Something that really annoys some people is when their bandwidth is being used by their images being linked directly on other websites.<\/p>\n<p>Here&#8217;s how you can see who&#8217;s doing this to your site. Just change www.example.net to your domain, and combined_log to your combined log file.<\/p>\n<p>awk -F\\&#8221; &#8216;($2 ~ \/\\.(jpg|gif)\/ &amp;&amp; $4 !~ \/^http:\\\/\\\/www\\.example\\.net\/){print $4}&#8217; combined_log \\<br \/>\n| sort | uniq -c | sort<br \/>\nTranslation:<\/p>\n<p>explode each row using &#8220;;<br \/>\nthe request line (%r) must contain &#8220;.jpg&#8221; or &#8220;.gif&#8221;;<br \/>\nthe referer must not start with your website address (www.example.net in this example);<br \/>\ndisplay the referer and summarise.<br \/>\nYou can block hot-linking using mod_rewrite but that can also result in blocking various search engine result pages, caches and online translation software. To see if this is happening, we look for 403 (&#8220;Forbidden&#8221;) errors in the image requests:<\/p>\n<p># list image requests that returned 403 Forbidden<br \/>\nawk &#8216;($9 ~ \/403\/)&#8217; combined_log \\<br \/>\n| awk -F\\&#8221; &#8216;($2 ~ \/\\.(jpg|gif)\/){print $4}&#8217; \\<br \/>\n| sort | uniq -c | sort<br \/>\nTranslation:<\/p>\n<p>the status code (%&gt;s) is 403 Forbidden;<br \/>\nthe request line (%r) contains &#8220;.jpg&#8221; or &#8220;.gif&#8221;;<br \/>\ndisplay the referer and summarise.<br \/>\nYou might notice that the above command is simply a combination of the previous, and one presented earlier. It is necessary to call awk more than once because the &#8216;referer&#8217; field is only available after the separator is set to \\&#8221;, wheras the &#8216;status code&#8217; is available directly.<\/p>\n<p>6. Blank User Agents<\/p>\n<p>A &#8216;blank&#8217; user agent is typically an indication that the request is from an automated script or someone who really values their privacy. The following command will give you a list of ip addresses for those user agents so you can decide if any need to be blocked:<\/p>\n<p>awk -F\\&#8221; &#8216;($6 ~ \/^-?$\/)&#8217; combined_log | awk &#8216;{print $1}&#8217; | sort | uniq<br \/>\nA further pipe through logresolve will give you the hostnames of those addresses.<\/p>\n<p>&nbsp;<\/p>\n<p>View Apache requests per day<\/p>\n<p>Run the following command to see requests per day:<br \/>\nawk &#8216;{print $4}&#8217; rmohan.com | cut -d: -f1 | uniq -c<br \/>\nView Apache requests per hour<\/p>\n<p>Run the following command to see requests per hour:<br \/>\ngrep &#8220;23\/Jan&#8221; rmohan.com | cut -d[ -f2 | cut -d] -f1 | awk -F: &#8216;{print $2&#8243;:00&#8243;}&#8217; | sort -n | uniq -c<br \/>\nView Apache requests per minute<\/p>\n<p>grep &#8220;23\/Jan\/2013:06&#8243; rmohan.com | cut -d[ -f2 | cut -d] -f1 | awk -F: &#8216;{print $2&#8243;:&#8221;$3}&#8217; | sort -nk1 -nk2 | uniq -c | awk &#8216;{ if ($1 &gt; 10) print $0}&#8217;<br \/>\n1- Most Common 404s (Page Not Found)<br \/>\ncut -d'&#8221;&#8216; -f2,3 \/var\/log\/apache\/access.log | awk &#8216;$4=404{print $4&#8221; &#8220;$2}&#8217; | sort | uniq -c | sort -rg<\/p>\n<p>2 &#8211; Count requests by HTTP code<\/p>\n<p>cut -d'&#8221;&#8216; -f3 \/var\/log\/apache\/access.log | cut -d&#8217; &#8216; -f2 | sort | uniq -c | sort -rg<\/p>\n<p>3 &#8211; Largest Images<br \/>\ncut -d'&#8221;&#8216; -f2,3 \/var\/log\/apache\/access.log | grep -E &#8216;\\.jpg|\\.png|\\.gif&#8217; | awk &#8216;{print $5&#8221; &#8220;$2}&#8217; | sort | uniq | sort -rg<\/p>\n<p>4 &#8211; Filter Your IPs Requests<br \/>\ntail -f \/var\/log\/apache\/access.log | grep &lt;your IP&gt;<\/p>\n<p>5 &#8211; Top Referring URLS<br \/>\ncut -d'&#8221;&#8216; -f4 \/var\/log\/apache\/access.log | grep -v &#8216;^-$&#8217; | grep -v &#8216;^http:\/\/www.rmohan.com&#8217; | sort | uniq -c | sort -rg<\/p>\n<p>6 &#8211; Watch Crawlers Live<br \/>\nFor this we need an extra file which we&#8217;ll call bots.txt. Here&#8217;s the contents:<\/p>\n<p>Bot<br \/>\nCrawl<br \/>\nai_archiver<br \/>\nlibwww-perl<br \/>\nspider<br \/>\nMediapartners-Google<br \/>\nslurp<br \/>\nwget<br \/>\nhttrack<\/p>\n<p>This just helps is to filter out common user agents used by crawlers.<br \/>\nHere&#8217;s the command:<br \/>\ntail -f \/var\/log\/apache\/access.log | grep -f bots.txt<\/p>\n<p>7 &#8211; Top Crawlers<br \/>\nThis command will show you all the spiders that crawled your site with a count of the number of requests.<br \/>\ncut -d'&#8221;&#8216; -f6 \/var\/log\/apache\/access.log | grep -f bots.txt | sort | uniq -c | sort -rg<br \/>\nHow To Get A Top Ten<br \/>\nYou can easily turn the commands above that aggregate (the ones using uniq) into a top ten by adding this to the end:<br \/>\n| head<\/p>\n<p>That is pipe the output to the head command.<br \/>\nSimple as that.<\/p>\n<p>Zipped Log Files<br \/>\nIf you want to run the above commands on a logrotated file, you can adjust easily by starting with a zcat on the file then piping to the first command (the one with the filename).<\/p>\n<p>So this:<br \/>\ncut -d'&#8221;&#8216; -f3 \/var\/log\/apache\/access.log | cut -d&#8217; &#8216; -f2 | sort | uniq -c | sort -rg<br \/>\nWould become this:<br \/>\nzcat \/var\/log\/apache\/access.log.1.gz | cut -d'&#8221;&#8216; -f3 | cut -d&#8217; &#8216; -f2 | sort | uniq -c | sort -rg<\/p>\n<p>&nbsp;<\/p>\n<p>Analyse an Apache access log for the most common IP addresses<br \/>\nTerminal &#8211; Analyse an Apache access log for the most common IP addresses<br \/>\ntail -10000 access_log | awk &#8216;{print $1}&#8217; | sort | uniq -c | sort -n | tail<br \/>\nTerminal &#8211; Alternatives<br \/>\nzcat access_log.*.gz | awk &#8216;{print $7}&#8217; | sort | uniq -c | sort -n | tail -n 20<br \/>\nawk &#8216;NR&lt;=10000{a[$1]++}END{for (i in a) printf &#8220;%-6d %s\\n&#8221;,a[i], i|&#8221;sort -n&#8221;}&#8217; access.log<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>Print lines within a particular time range<\/p>\n<p>awk &#8216;\/01:05:\/,\/01:20:\/&#8217; access.log<br \/>\nSort access log by response size (increasing)<\/p>\n<p>awk &#8211;re-interval &#8216;{ match($0, \/(([^[:space:]]+|\\[[^\\]]+\\]|&#8221;[^&#8221;]+&#8221;)[[:space:]]+){7}\/, m); print m[2], $0 }&#8217; access.log|sort -nk 1<\/p>\n<p>View TCP connection status<br \/>\nnetstat-nat | awk &#8216;{print $ 6}&#8217; | sort | uniq-c | sort-rn<br \/>\nnetstat-n | awk &#8216;\/ ^ tcp \/ {+ + S [$ NF]}; END {for (a in S) print a, S [a]}&#8217;<br \/>\nnetstat-n | awk &#8216;\/ ^ tcp \/ {+ + state [$ NF]}; END {for (key in state) print key, &#8220;\/ t&#8221;, state [key]}&#8217;<br \/>\nnetstat-n | awk &#8216;\/ ^ tcp \/ {+ + arr [$ NF]}; END {for (k in arr) print k, &#8220;\/ t&#8221;, arr [k]}&#8217;<br \/>\nnetstat-n | awk &#8216;\/ ^ tcp \/ {print $ NF}&#8217; | sort | uniq-c | sort-rn<br \/>\nnetstat-ant | awk &#8216;{print $ NF}&#8217; | grep-v &#8216;[az]&#8217; | sort | uniq-c<br \/>\nnetstat-ant | awk &#8216;\/ ip: 80 \/ {split ($ 5, ip, &#8220;:&#8221;); + + S [ip [1]]} END {for (a in S) print S [a], a}&#8217; | sort-n<br \/>\nnetstat-ant | awk &#8216;\/: 80 \/ {split ($ 5, ip, &#8220;:&#8221;); + + S [ip [1]]} END {for (a in S) print S [a], a}&#8217; | sort-rn | head-n 10<br \/>\nawk &#8216;BEGIN {printf (&#8220;http_code \/ tcount_num \/ n&#8221;)} {COUNT [$ 10] + +} END {for (a in COUNT) printf a &#8220;\/ t \/ t&#8221; COUNT [a] &#8220;\/ n&#8221;}&#8217;<br \/>\nFind the number of requests please 20 IP (commonly used to find the source of attack):<br \/>\nnetstat-anlp | grep 80 | grep tcp | awk &#8216;{print $ 5}&#8217; | awk-F: &#8216;{print $ 1}&#8217; | sort | uniq-c | sort-nr | head-n20<br \/>\nnetstat-ant | awk &#8216;\/: 80 \/ {split ($ 5, ip, &#8220;:&#8221;); + + A [ip [1]]} END {for (i in A) print A [i], i}&#8217; | sort-rn | head-n20<br \/>\nWith tcpdump the sniffer port 80 access to see who the highest<br \/>\ntcpdump-i eth0-tnn dst port 80-c 1000 | awk-F &#8220;.&#8221; &#8216;{print $ 1 &#8220;.&#8221; $ 2 &#8220;.&#8221; $ 3 &#8220;.&#8221; $ 4}&#8217; | sort | uniq-c | sort-nr | head &#8211; 20<br \/>\n4. Find more time_wait to connect<br \/>\nnetstat-n | grep TIME_WAIT | awk &#8216;{print $ 5}&#8217; | sort | uniq-c | sort-rn | head-n20<br \/>\nFind check more SYN connection.<br \/>\nnetstat-an | grep SYN | awk &#8216;{print $ 5}&#8217; | awk-F: &#8216;{print $ 1}&#8217; | sort | uniq-c | sort-nr | more<br \/>\n6 according to the port column process<br \/>\nnetstat-ntlp | grep 80 | awk &#8216;{print $ 7}&#8217; | cut-d \/-f1<br \/>\nWeb logs (Apache):<br \/>\n1 ip address to gain access to the top 10<br \/>\ncat access.log | awk &#8216;{print $ 1}&#8217; | sort | uniq-c | sort-nr | head -10<br \/>\ncat access.log | awk &#8216;{counts [$ (11)] + = 1}; END {for (url in counts) print counts [url], url}&#8217;<br \/>\n2 the Most Visited file or page, take the top 20 and statistics all IP<br \/>\ncat access.log | awk &#8216;{print $ 11}&#8217; | sort | uniq-c | sort-nr | head -20<br \/>\nawk &#8216;{print $ 1}&#8217; access.log | sort-n-r | uniq-c | wc-l<br \/>\n3 List the transmission exe file (analysis commonly used when the download station)<br \/>\ncat access.log | awk &#8216;($ 7 ~ \/ \/. exe \/) {print $ 10 &#8220;&#8221; $ 1 &#8220;&#8221; $ 4 &#8220;&#8221; $ 7}&#8217; | sort-nr | head -20<br \/>\nExe file and the corresponding file occurrences List output is greater than 200000byte (about 200kb)<br \/>\ncat access.log | awk &#8216;($ 10> 200000 &#038;&#038; $ 7 ~ \/ \/. exe \/) {print $ 7}&#8217; | sort-n | uniq-c | sort-nr | head -100<br \/>\nIf the log of a page file transfer time, the page lists the most time-consuming to client<br \/>\ncat access.log | awk &#8216;($ 7 ~ \/ \/. php \/) {print $ NF &#8220;&#8221; $ 1 &#8220;&#8221; $ 4 &#8220;&#8221; $ 7}&#8217; | sort-nr | head -100<br \/>\n6 page lists the most time-consuming (more than 60 seconds) as well as the frequency and the corresponding page<br \/>\ncat access.log | awk &#8216;($ NF> 60 &#038;&#038; $ 7 ~ \/ \/. php \/) {print $ 7}&#8217; | sort-n | uniq-c | sort-nr | head -100<br \/>\nList file transmission time over 30 seconds<br \/>\ncat access.log | awk &#8216;($ NF> 30) {print $ 7}&#8217; | sort-n | uniq-c | sort-nr | head -20<br \/>\nStatistics website traffic (G)<br \/>\ncat access.log | awk &#8216;{sum + = $ 10} END {print sum\/1024\/1024\/1024}&#8217;<br \/>\nStatistics 404 connections 9.<br \/>\nawk &#8216;($ 9 ~ \/ 404 \/)&#8217; access.log | awk &#8216;{print $ 9, $ 7}&#8217; | sort<br \/>\n10 Statistics http status.<br \/>\ncat access.log | awk &#8216;{counts [$ (9)] + = 1}; END {for (code in counts) print code, counts [code]}&#8217;<br \/>\ncat access.log | awk &#8216;{print $ 9}&#8217; | sort | uniq-c | sort-rn<br \/>\n11 concurrent per second:<br \/>\nawk &#8216;{if ($ 9 ~ \/ 200 | 30 | 404 \/) COUNT [$ 4] + +} END {for (a in COUNT) print a, COUNT [a]}&#8217; | sort-k 2-nr | head-n10<br \/>\n12. Bandwidth statistics<br \/>\ncat apache.log | awk &#8216;{if ($ 7 ~ \/ GET \/) count + +} END {print &#8220;client_request =&#8221; count}&#8217;<br \/>\ncat apache.log | awk &#8216;{BYTE + = $ 11} END {print &#8220;client_kbyte_out =&#8221; BYTE\/1024 &#8220;KB&#8221;}&#8217;<br \/>\nAverage size of 13. Statistical number of objects and object<br \/>\ncat access.log | awk &#8216;{byte + = $ 10} END {print byte\/NR\/1024, NR}&#8217;<br \/>\ncat access.log | awk &#8216;{if ($ 9 ~ \/ 200 | 30 \/) COUNT [$ NF] + +} END {for (a in COUNT) print a, COUNT<br \/>\n[A], NR, COUNT [a] \/ NR * 100 &#8220;%&#8221;}<br \/>\n14 to take a 5-minute log<br \/>\nif [$ DATE_MINUTE! = $ DATE_END_MINUTE]; then # determine the start timestamp and end timestamps are equal START_LINE = `sed-n&#8221; \/ $ DATE_MINUTE \/ = &#8220;$ APACHE_LOG | head-n1` # if not equal, then remove the the line numbers start timestamp and the end timestamp line number<br \/>\n# END_LINE = `sed-n&#8221; \/ $ DATE_END_MINUTE \/ = &#8220;$ APACHE_LOG | tail-n1`<br \/>\nEND_LINE = `sed-n&#8221; \/ $ DATE_END_MINUTE \/ = &#8220;$ APACHE_LOG | head-n1` sed-n &#8220;$ {START_LINE}, $ {END_LINE} p&#8221; $ APACHE_LOG> $ MINUTE_LOG # # by line number, remove the 5 minutes the contents of the log is stored into a temporary file<br \/>\nGET_START_TIME = `sed-n&#8221; $ {START_LINE} p &#8220;$ APACHE_LOG | awk-F &#8216;[&#8216; &#8216;{print $ 2}&#8217; | awk &#8216;{print $ 1}&#8217; |<br \/>\nsed &#8216;s # \/ # # g&#8217; | sed &#8216;s #: # #&#8217; `# remove timestamp obtained by the line number<br \/>\nGET_END_TIME = `sed-n&#8221; $ {END_LINE} p &#8220;$ APACHE_LOG | awk-F &#8216;[&#8216; &#8216;{print $ 2}&#8217; | awk &#8216;{print $ 1}&#8217; | sed<br \/>\n&#8216;S # \/ # # g&#8217; | sed &#8216;s #: # #&#8217; `# line number to get the end timestamp<br \/>\n15 Spiders analysis<br \/>\nSee which spiders crawl the content<br \/>\n\/ Usr \/ sbin \/ tcpdump-i eth0-l-s 0-w &#8211; dst port 80 | strings | grep-i user-agent | grep-i-E &#8216;bot | crawler | slurp | spider&#8217;<br \/>\nSite on the analysis 2 (Squid papers)<br \/>\nA flow rate of 2 domain statistics<br \/>\nzcat squid_access.log.tar.gz | awk &#8216;{print $ 10, $ 7}&#8217; | awk &#8216;BEGIN {FS = &#8220;[\/]&#8221;} {trfc [$ 4] + = $ 1} END {for<br \/>\n(Domain in trfc) {printf &#8220;% s \/ t% d \/ n&#8221;, domain, trfc [domain]}} &#8216;<br \/>\nEfficient perl version, please download here:<br \/>\nDatabase articles<br \/>\n1 View sql database<br \/>\n\/ Usr \/ sbin \/ tcpdump-i eth0-s 0-l-w &#8211; dst port 3306 | strings | egrep-i &#8216;SELECT | UPDATE | DELETE | INSERT | SET | COMMIT | ROLLBACK | CREATE | DROP | ALTER | CALL&#8217;<br \/>\nSystem Debug analyze articles<br \/>\n1 debug command<br \/>\nstrace-p pid<br \/>\nTracking the specified process PID<br \/>\ngdb-p PID<\/p>\n<p>CHECKING FOR HIGH VISITS FROM A LIMITED NUMBER OF IPS<\/p>\n<p>First locate the log file for your site. The generic log is generally at \/var\/log\/httpd\/access_log or\/var\/log\/apache2\/access_log (depending on your distro). For virtualhost-specific logs, check the conf files or (if you have one active site and others in the background) run ls -alt \/var\/log\/httpd to see which file is most recently updated.<\/p>\n<p>cat access.log | awk &#8216;{print $1}&#8217; | sort | uniq -c | wc -l<\/p>\n<p>2. Check out unique visitors today:-<br \/>\ncat access.log | grep `date &#8216;+%e\/%b\/%G&#8217;` | awk &#8216;{print $1}&#8217; | sort | uniq -c | wc -l<\/p>\n<p>3. Check out unique visitors this month:-<br \/>\ncat access.log | grep `date &#8216;+%b\/%G&#8217;` | awk &#8216;{print $1}&#8217; | sort | uniq -c | wc -l<\/p>\n<p>4. Check out unique visitors on arbitrary nomber:-<br \/>\ncat access.log | grep 22\/Mar\/2013 | awk &#8216;{print $1}&#8217; | sort | uniq -c | wc -l<\/p>\n<p>5. Check out unique visitors Month of march:-<br \/>\ncat access.log | grep Mar\/2013 | awk &#8216;{print $1}&#8217; | sort | uniq -c | wc -l<\/p>\n<p>6. Check out statistics of number of visit\/request &#8220;visitors IP&#8221;-<br \/>\ncat access.log | awk &#8216;{print &#8220;requests from &#8221; $1}&#8217; | sort | uniq -c | sort<\/p>\n<p>7. Check out statistics of number of visit\/request with date-<br \/>\ncat access.log | grep 26\/Mar\/2013 | awk &#8216;{print &#8220;requests from &#8221; $1}&#8217; | sort | uniq -c | sort<\/p>\n<p>8. Find out  targets the last 5,000 hits:<\/p>\n<p>tail -5000 access.log| awk &#8216;{print $1}&#8217; | sort | uniq -c |sort -n<\/p>\n<p>9. Finally, if you have a ton of domains you may want to use this to aggregate them:<\/p>\n<p>for k in `ls &#8211;color=none`; do echo &#8220;Top visitors by ip for: $k&#8221;;awk &#8216;{print $1}&#8217; ~\/logs\/$k\/http\/access.log|sort|uniq -c|sort -n|tail;done<\/p>\n<p>10. This command is great if you want to see what is being called the most (that can often show you that a specific script is being abused if it&#8217;s being called way more times than anything else in the site):<\/p>\n<p>awk &#8216;{print $7}&#8217; access.log|cut -d? -f1|sort|uniq -c|sort -nk1|tail -n10<\/p>\n<p>11. If you have multiple domains on and on a PS (PS only!) run this command to get all traffic for all domains on the PS:<\/p>\n<p>for k in `ls -S \/home\/*\/logs\/*\/http\/access.log`; do wc -l $k | sort -r -n; done<\/p>\n<p>12. Here is an alternative to the above command which does the same thing, this is for VPS only using an admin user:<br \/>\n sudo find \/home\/*\/logs -type f -name &#8220;access.log&#8221; -exec wc -l &#8220;{}&#8221; \\; | sort -r -n<\/p>\n<p>13. If you&#8217;re on a shared server you can run this command which will do the same as the one above but just to the domains in your logs directory. You have to run this commands while your in your user&#8217;s logs directory:<br \/>\nfor k in `ls -S *\/http\/access.log`; do wc -l $k | sort -r -n; done<\/p>\n<p>14. grep apache access.log and list IP&#8217;s by hits and date:-<br \/>\ngrep Mar\/2013 \/var\/log\/apache2\/access.log | awk &#8216;{ print $1 }&#8217; | sort -n | uniq -c | sort -rn | head<\/p>\n<p>15. Find out top referring URLs: <\/p>\n<p> cut -d'&#8221;&#8216; -f4 \/var\/log\/apache\/access.log | grep -v &#8216;^-$&#8217; | grep -v &#8216;^http:\/\/www.your-site.com&#8217; | sort | uniq -c | sort -rg<\/p>\n<p>16. Check out top of &#8216;Page Not Found&#8217;s (404):<br \/>\ncut -d'&#8221;&#8216; -f2,3 \/var\/log\/apache\/access.log | awk &#8216;$4=404{print $4&#8243; &#8220;$2}&#8217; | sort | uniq -c | sort -rg<\/p>\n<p>17. Check out top largest images:-<br \/>\ncut -d'&#8221;&#8216; -f2,3 \/var\/log\/apache\/access.log | grep -E &#8216;\\.jpg|\\.png|\\.gif&#8217; | awk &#8216;{print $5&#8221; &#8220;$2}&#8217; | sort | uniq | sort -rg<br \/>\n18. Check out server response code:-<br \/>\ncut -d'&#8221;&#8216; -f3 \/var\/log\/apache\/access.log | cut -d&#8217; &#8216; -f2 | sort | uniq -c | sort -rg<br \/>\n19. Check out Apache request per day<br \/>\nawk &#8216;{print $4}&#8217; access.log | cut -d: -f1 | uniq -c<br \/>\n20. Check out Apache request per houre.<br \/>\ngrep &#8220;6\/May&#8221; access.log | cut -d[ -f2 | cut -d] -f1 | awk -F: &#8216;{print $2&#8243;:00&#8243;}&#8217; | sort -n | uniq -c<br \/>\n21. Check out Apache request per minute.<br \/>\ngrep &#8220;6\/May\/2013:06&#8243; access.log | cut -d[ -f2 | cut -d] -f1 | awk -F: &#8216;{print $2&#8221;:&#8221;$3}&#8217; | sort -nk1 -nk2 | uniq -c | awk &#8216;{ if ($1 > 10) print $0}&#8217;<br \/>\n22. All those commands can be easily run on a log-rotated file:<br \/>\nzcat \/var\/log\/apache\/access.log.1.gz | cut -d'&#8221;&#8216; -f3 | cut -d&#8217; &#8216; -f2 | sort | uniq -c | sort -rg<\/p>\n<p>Example:<br \/>\nGrep log log analysis collate finishing<br \/>\n1 . Analyze log files to access the page next 2012-05-04 The top 20 URL and sorting<br \/>\ncat access.log | grep &#8217;04 \/ May\/2012 &#8216;| awk&#8217; {print $ 11} &#8216;| sort | uniq-c | sort-nr | head -20<br \/>\nQuery the URL address to access the page URL contains the IP address of www.abc.com<br \/>\ncat access_log | awk &#8216;($ 11 ~ \/ \\ www.abc.com\/) {print $ 1}&#8217; | sort | uniq-c | sort-nr<br \/>\n(2) to gain access to up to 10 IP addresses can also be queried by time<br \/>\ncat linewow-access.log | awk &#8216;{print $ 1}&#8217; | sort | uniq-c | sort-nr | head -10<br \/>\n1 to gain access to the ip address before 10<br \/>\ncat access.log | awk &#8216;{print $ 1}&#8217; | sort | uniq-c | sort-nr | head -10<br \/>\ncat access.log | awk &#8216;{counts [$ (11)] + = 1}; END {for (url in counts) print counts [url], url}&#8217;<br \/>\n2 Most Visited file or page , take the top 20 and all access to IP Statistics<br \/>\ncat access.log | awk &#8216;{print $ 11}&#8217; | sort | uniq-c | sort-nr | head -20<br \/>\nawk &#8216;{print $ 1}&#8217; access.log | sort-n-r | uniq-c | wc-l<br \/>\ncat wangsu.log | egrep &#8217;06 \/ Sep\/2012: 14:35 | 06\/Sep\/2012: 15:05 &#8216;| awk&#8217; {print $ 1} &#8216;| sort | uniq-c | sort-nr | head -10 query log period of time the situation<br \/>\n3 lists some of the largest transfer exe file ( download station when analyzing common )<br \/>\ncat access.log | awk &#8216;($ 7 ~ \/ \\. exe \/) {print $ 10 &#8220;&#8221; $ 1 &#8220;&#8221; $ 4 &#8220;&#8221; $ 7}&#8217; | sort-nr | head -20<br \/>\n4 lists the output is greater than 200000byte ( about 200kb) an exe file and the number of occurrences of the corresponding file<br \/>\ncat access.log | awk &#8216;($ 10> 200000 &#038;&#038; $ 7 ~ \/ \\. exe \/) {print $ 7}&#8217; | sort-n | uniq-c | sort-nr | head -100<br \/>\n5 If the log records the last one is the page file transfer time , there are lists to the client the most time-consuming page<br \/>\ncat access.log | awk &#8216;($ 7 ~ \/ \\. php \/) {print $ NF &#8220;&#8221; $ 1 &#8220;&#8221; $ 4 &#8220;&#8221; $ 7}&#8217; | sort-nr | head -100<br \/>\n6 lists the most time-consuming page ( more than 60 seconds ) as well as the corresponding page number of occurrences<br \/>\ncat access.log | awk &#8216;($ NF> 60 &#038;&#038; $ 7 ~ \/ \\. php \/) {print $ 7}&#8217; | sort-n | uniq-c | sort-nr | head -100<br \/>\n7 lists the transmission of documents longer than 30 seconds<br \/>\ncat access.log | awk &#8216;($ NF> 30) {print $ 7}&#8217; | sort-n | uniq-c | sort-nr | head -20<br \/>\n8 Statistics website traffic (G)<br \/>\ncat access.log | awk &#8216;{sum + = $ 10} END {print sum\/1024\/1024\/1024}&#8217;<br \/>\n9 Statistics 404 connection<br \/>\nawk &#8216;($ 9 ~ \/ 404 \/)&#8217; access.log | awk &#8216;{print $ 9, $ 7}&#8217; | sort<br \/>\n10 Statistical http status.<br \/>\ncat access.log | awk &#8216;{counts [$ (9)] + = 1}; END {for (code in counts) print code, counts [code]}&#8217;<br \/>\ncat access.log | awk &#8216;{print $ 9}&#8217; | sort | uniq-c | sort-rn<br \/>\n11 sec Concurrency :<br \/>\nawk &#8216;{if ($ 9 ~ \/ 200 | 30 | 404 \/) COUNT [$ 4] + +} END {for (a in COUNT) print a, COUNT [a]}&#8217; | sort-k 2-nr | head-n10<br \/>\n12 . Bandwidth statistics<br \/>\ncat apache.log | awk &#8216;{if ($ 7 ~ \/ GET \/) count + +} END {print &#8220;client_request =&#8221; count}&#8217;<br \/>\ncat apache.log | awk &#8216;{BYTE + = $ 11} END {print &#8220;client_kbyte_out =&#8221; BYTE\/1024 &#8220;KB&#8221;}&#8217;<br \/>\nOne day out of the 10 most visited IP<br \/>\ncat \/ tmp \/ access.log | grep &#8220;20\/Mar\/2011&#8221; | awk &#8216;{print $ 3}&#8217; | sort | uniq-c | sort-nr | head<br \/>\nMaximum number of connections that day ip ip are doing :<br \/>\ncat access.log | grep &#8220;10.0.21.17&#8221; | awk &#8216;{print $ 8}&#8217; | sort | uniq-c | sort-nr | head-n 10<br \/>\nFind out the most visited several minutes<br \/>\nawk &#8216;{print $ 1}&#8217; access.log | grep &#8220;20\/Mar\/2011&#8221; | cut-c 14-18 | sort | uniq-c | sort-nr | head<br \/>\nAttachment: View tcp connection status<br \/>\nnetstat-nat | awk &#8216;{print $ 6}&#8217; | sort | uniq-c | sort-rn<br \/>\nnetstat-n | awk &#8216;\/ ^ tcp \/ {+ + S [$ NF]}; END {for (a in S) print a, S [a]}&#8217;<br \/>\nnetstat-n | awk &#8216;\/ ^ tcp \/ {+ + state [$ NF]}; END {for (key in state) print key, &#8220;\\ t&#8221;, state [key]}&#8217;<br \/>\nnetstat-n | awk &#8216;\/ ^ tcp \/ {+ + arr [$ NF]}; END {for (k in arr) print k, &#8220;\\ t&#8221;, arr [k]}&#8217;<br \/>\nnetstat-n | awk &#8216;\/ ^ tcp \/ {print $ NF}&#8217; | sort | uniq-c | sort-rn<br \/>\nnetstat-ant | awk &#8216;{print $ NF}&#8217; | grep-v &#8216;[az]&#8217; | sort | uniq-c<br \/>\nnetstat-ant | awk &#8216;\/ ip: 80 \/ {split ($ 5, ip, &#8220;:&#8221;); + + S [ip [1]]} END {for (a in S) print S [a], a}&#8217; | sort-n<br \/>\nnetstat-ant | awk &#8216;\/: 80 \/ {split ($ 5, ip, &#8220;:&#8221;); + + S [ip [1]]} END {for (a in S) print S [a], a}&#8217; | sort-rn | head-n 10<br \/>\nawk &#8216;BEGIN {printf (&#8220;http_code \\ tcount_num \\ n&#8221;)} {COUNT [$ 10] + +} END {for (a in COUNT) printf a &#8220;\\ t \\ t&#8221; COUNT [a] &#8220;\\ n&#8221;}&#8217;<br \/>\n(2) Find requests please 20 IP ( commonly used in the attack source lookup ) :<br \/>\nnetstat-anlp | grep 80 | grep tcp | awk &#8216;{print $ 5}&#8217; | awk-F: &#8216;{print $ 1}&#8217; | sort | uniq-c | sort-nr | head-n20<br \/>\nnetstat-ant | awk &#8216;\/: 80 \/ {split ($ 5, ip, &#8220;:&#8221;); + + A [ip [1]]} END {for (i in A) print A [i], i}&#8217; | sort-rn | head-n20<br \/>\n3 with a sniffer tcpdump port 80 access to see who the highest<br \/>\ntcpdump-i eth0-tnn dst port 80-c 1000 | awk-F &#8220;.&#8221; &#8216;{print $ 1 &#8220;.&#8221; $ 2 &#8220;.&#8221; $ 3 &#8220;.&#8221; $ 4}&#8217; | sort | uniq-c | sort-nr | head &#8211; 20<br \/>\n4 Find more time_wait connection<br \/>\nnetstat-n | grep TIME_WAIT | awk &#8216;{print $ 5}&#8217; | sort | uniq-c | sort-rn | head-n20<br \/>\n5 more investigation to find SYN connections<br \/>\nnetstat-an | grep SYN | awk &#8216;{print $ 5}&#8217; | awk-F: &#8216;{print $ 1}&#8217; | sort | uniq-c | sort-nr | more<br \/>\n6 According to port out process<br \/>\nnetstat-ntlp | grep 80 | awk &#8216;{print $ 7}&#8217; | cut-d \/-f1<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since I (and you as a visitor) don\u2019t want your IP-address to be spread around the internet, I\u2019ve anonymized the log data. It\u2019s a fairly easy process that is done in 2 steps: IP\u2019s are translated into random values. Admin url\u2019s are removed. Step 1: Translating IP\u2019s <\/p>\n<p>All the IP\u2019s are translated into random IP\u2019s, [&#8230;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[],"_links":{"self":[{"href":"https:\/\/mohan.sg\/index.php?rest_route=\/wp\/v2\/posts\/2786"}],"collection":[{"href":"https:\/\/mohan.sg\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mohan.sg\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mohan.sg\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mohan.sg\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2786"}],"version-history":[{"count":7,"href":"https:\/\/mohan.sg\/index.php?rest_route=\/wp\/v2\/posts\/2786\/revisions"}],"predecessor-version":[{"id":3379,"href":"https:\/\/mohan.sg\/index.php?rest_route=\/wp\/v2\/posts\/2786\/revisions\/3379"}],"wp:attachment":[{"href":"https:\/\/mohan.sg\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2786"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mohan.sg\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2786"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mohan.sg\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2786"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}