Most importantly you want to store your data in an optimal storage format.
Avoid text, JSON, and even sequence file when possible.
Ideally use RCFile. RCFile (Row Columnar File) format is optimised to utilise Hadoop’s parallelism by storing data in row groups which then are organised in columnar sections. The row groups are stored (one ore more per file) redundantly on HDFS and the same data can be processed by several nodes. The columnar structure improves compression (similarity of data) and allows to skip columns irrelevant to a query (reduce disk IO).
Generally, use block compression over value compression since it is more efficient (see mapred.output.compression.type setting). Try and use splittable compression algorithms provided by Hadoop & Hive like Snappy (see mapred.output.compression.codec setting). Avoid Gzip – not splittable and CPU intensive – unless you ingest pre-split Gzipped files from a secondary process.
Ensure your final Hive output is always compressed:
And also your intermediate output should be compressed:
Hive supports S3 as a storage layer. Use it for backups and not as your main read/write storage since you will need to read and write across the network. HDFS is great to store your data and gives Hadoop a chance to optimise data access.
Partition large tables
Partition tables if you collect time series data or logs that accumulate over time and you only need to query parts of it. You can store the data in a subdirectory tree like year/month/day or if it is geographic data continent/country/region/city and so forth. That allows your query to skip the irrelevant data.
Use small tables on the left side of your joins and enable auto optimisation:
This will cache small enough tables and allow for mapper-only joins which are much faster. You can also tweak the size of what HIve considers small enough to cache:
Make sure your JVM has enough memory though.
Avoid Order By
One of the slowest things you can do is to order a large output. Order is global and not yet optimised in Hive and results in one final reducer sorting the data. If you can use sort by or cluster by which organises data on a per reducer (local) level which is much faster of course.
Lastly, optimise Hadoop of course and keep an eye on the jobtracker and logs of the mapreduce jobs like spills. You can increment your JVM and sort cache size, increase min and max split size to ensure that you don’t create countless tiny map tasks and so forth.