- Big Data
- 16. Sep
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.
Presto has a few basic requirements:
- Linux or Mac OS X
- Java 8, 64-bit
- Python 2.4+
WHAT CAN IT DO?
Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.
Presto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow "free" solution that requires excessive hardware.
WHO USES IT?
Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in the total scan over a petabyte each per day.
-Leading internet companies including Airbnb and Dropbox are using Presto.
Presto is amazing. Lead engineer Andy Kramolisch got it into production in just a few days. It's an order of magnitude faster than Hive in most our use cases. It reads directly from HDFS, so unlike Redshift, there isn't a lot of ETL before you can use it. It just works.
-Christopher Gutierrez, Manager of Online Analytics, Airbnb
We're really excited about Presto. We're planning on using it to quickly gain insight about the different ways our users use Dropbox, as well as diagnosing problems they encounter along the way. In our tests so far it's been rock solid and extremely fast when applied to some of our most important ad hoc use cases.
-Fred Wulff, Software Engineer, Dropbox
Queries are running slower than expected. What are the factors that influence Presto performance?
- The first things to check are the basic machine stats for your workers and coordinators. Measure the load, network, and disk utilization over time to understand where Presto is running out of resources.
- If the Presto process is mostly idle, this means that Presto cannot retrieve data fast enough from the HDFS data node. This could be caused by network or disk bandwidth or CPU on the data node.
- If the Presto process is using 100% CPU, it might be caused by the use of an expensive to parse input format. For example, the Textfile is a very expensive input format to parse.
- If neither of the above is true, the Presto process may have some sort of internal resource starvation. If this is the case you should take a thread dump of your coordinators and workers with a tool like jstack as a starting point for your investigation.