Use AWStats with Amazon S3 / CloudFront

I’ve recently written two posts about AWS and log processing with an external server so here comes a final post to wrap it all up and tie it together.

Goal

To have an automatic cron process that polls Amazon S3 and/or CloudFront log files to your own server and then using logresolvemerge to combine them for processing with AWStats.

Requirements

An Amazon Web Services account where you collect your log files in a bucket. Your own dedicated server, VPS or host where you have shell access to install and configure your own solutions so you can have Python and boto installed as well as adding your own scripts. Also you should have AWStats installed and up and running.
I’ve made this setup using Ubuntu 10.04 which I run on a VPS over at Linode where I currently run a couple of projects.

Setup

I’m currently collecting logs from one CloudFront distribution and one S3 bucket which I have mapped my own CNAMEs to. Let’s call them s3.example.com and cdn.example.com. Then I have another bucket which is not public but where I store other things. So in this bucket I’ve made a folder named logs and in that folder I have a folder for each domain.
So the log prefixes for me in this case becomes
/logs/s3.example.com
/logs/cdn.example.com

Download AWS Logs

To be able to download the log files from Amazon to a local directory, check out my earlier post about downloading AWS logs with boto and Python.

Configure AWStats

Now we need to setup some AWStats configuration to prepare it for handling the AWS logs. I’ll create to configuration files for this example:
/etc/awstats/awstats.cdn.example.com.conf
/etc/awstats/awstats.s3.example.com.conf

I assume that you already know your way around configuring AWStats so I’ll focus on the specifics for AWS compability. The final log files that will be created later on for AWStats to use will be stored in /var/log/apache2/ so I point the LogFile option to that location. And then we just have to setup the LogFormat correctly. Below is the setup for S3 log files and then for CloudFront log files.

S3 AWStats LogFormat

LogFile="/var/log/apache2/s3.example.com.log"
LogFormat="%other %extra1 %time1 %host %logname %other %method %url %otherquot %code %extra2 %bytesd %other %extra3 %extra4 %refererquot %uaquot %other"

CloudFront AWStats LogFormat

LogFile="/var/log/apache2/cdn.example.com.log"
LogFormat="%time2 %cluster %bytesd %host %method %virtualname %url %code %referer %ua %query"

If you also want the CloudFront statistics to display information about the edges you can check out my post about CloudFront Edges in AWStats.

Automate everything

Now when we have all components in place, we just need to automate them so we later on can add it to cron. I’ve made a bash script which takes care of the automation. The script is not very complicated but I’ll make a quick walk through of it, so it can be modified to specific needs and setups. I added a number at the comment for each section in the script which I use as a reference in the list below.

  1. A few variables used in the script. The date variable is just to collect the current date. I don’t really use this information at the moment other than appending it to a temporary directory name. But in case I want to expand on the script in the future to keep the archives around it could be handy. Then I create a variable for each log I want to process. In this example I process two logs, one S3 and one CloudFront, so I have 2 variables here containing paths to temp directories where the log files will be downloaded.
  2. Here we use the boto Python script I created earlier to download all log files from Amazon to our local temp directories.
  3. Now when all the log files have been downloaded, we need to combine them into a format that AWStats can understand. The first line combines the CloudFront logs. They are very straightforward so they just need to be combined into one large file and AWStats are ready to process it. Then the second line is to process the S3 log files into one large log file. S3 is a bit more tricky as it contains a few things AWStats don’t understand, so I use a regexp to remove the things that would cause AWStats some headache. I store my AWS final log files in /var/log/apache2/ which is the path I defined in the LogFile option for AWStats earlier.
  4. Our log files are now downloaded and combined into the final log files that are stored in /var/log/apache2/ so I simply delete the temporary downloaded files, as I don’t need to keep them around anymore.
  5. And finally we execute AWStats to update the statistics with the log files we just have processed.
#!/bin/bash
# Initial, cron script to download and merge AWS logs
# 29/11 - 2010, Johan Steen

# 1. Setup variables
date=`date +%Y-%m-%d`
cdn_folder="/tmp/log_cdn_$date/"
static_folder="/tmp/log_static_$date/"

# 2. Call the python script to download log folders from Amazon to local folders
python /home/johan/get-aws-logs.py --prefix=logs/cdn.example.com/ --local=$cdn_folder
python /home/johan/get-aws-logs.py --prefix=logs/s3.example.com/ --local=$static_folder

# 3. Merge and add the downloaded log files to the local log file
/usr/local/bin/logresolvemerge.pl ${cdn_folder}* >> /var/log/apache2/cdn.example.com.log
/usr/local/bin/logresolvemerge.pl ${static_folder}* | sed -e 's/SOAP\.\([A-Z]*\)/\1/' -e 's/REST\.\([A-Z]*\)\.[A-Z]*/\1/' >> /var/log/apache2/s3.example.com.log

# 4. Delete the downloaded log files
rm -rf $cdn_folder
rm -rf $static_folder

# 5. Update the AWStats Logs
/usr/lib/cgi-bin/awstats.pl -config=cdn.example.com -update
/usr/lib/cgi-bin/awstats.pl -config=s3.example.com -update

Cron it!

And finally, add the bash script to your cron to be run as often as you feel is appropriate for your setup.

# Process the AWS Logs at 4:43 every night
43 4 * * * root /home/johan/get-aws-logs.sh >/dev/null

And that’s it. Feel free to leave a comment if you have any questions or suggestions for improvements.

Liked this post?

Subscribe to the site feed with RSS or by email.

Tags: , , , ,
Category: Cloud Services

Comments

  1. AndyMarch 2, 2011

    Thanks for the article. I was able to use your setup with minimal changes. A couple notes below.

    Because the S3 logs are in GMT, to display the stats in EST I added to my awstats config
    LoadPlugin=”timezone -5″

    Also, because the AWS management console was producing log entries, I had to add to SkipHosts “REGEX[^10\.]” along with my external IP.

    With the get-aws-logs.py script, I kept forgetting the trailing slash in my local directory. To make it more forgiving, I modified each instance of
    self.LOCAL_PATH+filename
    to
    os.path.join(self.LOCAL_PATH, filename)

    ( Reply )
    1. JohanMarch 3, 2011

      Andy,

      Thanks for chiming in and adding your changes. Very nice and much appreciated!

      Cheers,
      Johan

      ( Reply )
    2. AndyOctober 5, 2011

      I was revisiting this with the goal of tracking a specific parameter in the request query strings. I had problems because the LogFormat being used did not support the %query option.

      To add this functionality, I changed my AWStats config from
      LogFormat=”%other %extra1 %time1 %host %logname %other %method %url %otherquot %code %extra2 %bytesd %other %extra3 %extra4 %refererquot %uaquot %other”

      to
      LogFormat=”%other %extra1 %time1 %host %logname %other %other %other %methodurl %code %extra2 %bytesd %other %extra3 %extra4 %refererquot %uaquot %other”

      This allowed me to use options like URLWithQuery, and the QUERY_STRING condition in an ExtraSection

      ( Reply )
  2. rkMay 7, 2011

    Thanks for the script. Would appreciate if you can provide LogFormat for cloudfront streaming logs.

    ( Reply )
  3. ImthiazMay 18, 2011

    You are a genious. Thanks for sharing this wonderful script.

    I had some issues with cloudfront logs. Since the log tab as seperator I had some issue with awstats not recognizing %time2.

    I made some changes to awstats.conf
    LogSeparator=”\t”

    And in the script

    logresolvemerge.pl ${static_folder}* | sed -r -e ‘s/([0-9]{4}-[0-9]{2}-[0-9]{2})\t([0-9]{2}:[0-9]{2}:[0-9]{2})/\1 \2/g’ >> access.log

    Hope it helps for someone.

    Thanks
    Imthiaz

    ( Reply )
  4. ColinJuly 7, 2011

    I ran into an issue where I wanted to track image hits as pageviews, then found the solution was to hack awstats.pl: http://stackoverflow.com/questions/6603972/how-to-track-jpg-hits-as-page-views-in-awstats-7-0

    ( Reply )

Leave a Reply

( Get a Gravatar )
  1. Gravatar

    Your Name
    February 22, 2012