<t>I'm parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent. <br/>
<br/>
Tons of logs generated every hour and that number likely to be increased dramatically in near future - so processing that kind of data in distributed manner via Amazon Elastic MapReduce sounds reasonable.<br/>
<br/>
Right now I'm ready with mappers and reducers to process my data and tested the whole process with the following flow:<br/>
<br/>
- uploaded mappers, reducers and data to Amazon S3<br/>
<br/>
- configured appropriate job and processed it successfully<br/>
<br/>
- downloaded aggregated results from Amazon S3 to my server and inserted them into MySQL database by running CLI script<br/>
<br/>
I've done that manually according to thousands of tutorials that are googlable on the Internet about Amazon ERM. <br/>
<br/>
What should I do next? What is a best approach to automate this process?<br/>
<br/>
- Should I control Amazon EMR jobTracker via API?<br/>
<br/>
- How can I make sure my logs will not be processed twice?<br/>
<br/>
- What is the best way to move processed files to archive?<br/>
<br/>
- What is the best approach to insert results into PostgreSQL/MySQL?<br/>
<br/>
- How data for the jobs should be laid out in input/output directories?<br/>
<br/>
- Should I create a new EMR job each time using the API?<br/>
<br/>
- What is the best approach to upload raw logs to Amazon S3?<br/>
<br/>
- Can anyone share their setup of the data processing flow?<br/>
<br/>
- How to control file uploads and jobs completions?<br/>
<br/>
I think that this topic can be useful for many people who try to process access logs with Amazon Elastic MapReduce but were not able to find good materials and/or best practices.<br/>
<br/>
UPD: Just to clarify here is the single final question:<br/>
<br/>
What are best practices for logs processing powered by Amazon Elastic MapReduce?<br/>
<br/>
Related posts:<br/>
<br/>
Getting data in and out of Elastic MapReduce HDFS</t>