- Athena is an interactive query service that easily analyzes data directly in Amazon S3 using standard SQL.
- Athena also makes it easy to interactively run data analytics using Apache Spark without planning for, configuring, or managing resources.
- When you run Apache Spark applications on Athena, you submit Spark code for processing and receive the results directly.
- Use the simplified notebook experience in the Amazon Athena console to develop Apache Spark applications using Python or Athena notebook APIs.
What is Amazon Athena? - Amazon Athena
- Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3.
- Examples include CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC.
- You can use Athena to run ad-hoc queries using ANSI SQL, without aggregating or loading the data into Athena.
- Athena integrates with Amazon QuickSight for easy data visualization.
- You can use Athena to generate reports or to explore data with business intelligence tools or SQL clients connected with a JDBC or an ODBC driver.
- Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for your data in Amazon S3.
- This allows you to create tables and query data in Athena based on a central metadata store available throughout your Amazon Web Services account and integrated with AWS Glue's ETL and data discovery features.
- For more information, see Integration with AWS Glue and What is AWS Glue in the AWS Glue Developer Guide.
- Amazon Athena makes it easy to run interactive queries against data directly in Amazon S3 without having to format data or manage infrastructure.
- For example, Athena is helpful if you want to run a quick query on web logs to troubleshoot a performance issue on your site.
- With Athena, you can get started fast: you define a table for your data and start querying using standard SQL.
- You should use Amazon Athena to run interactive ad hoc SQL queries against data on Amazon S3, without managing any infrastructure or clusters.
- Amazon Athena provides the easiest way to run ad hoc queries for data in Amazon S3 without setting up or managing servers.
- For a list of AWS services with which Athena leverages or integrates, see AWS service integrations with Athena.
Amazon EMR
- Amazon EMR makes it simple and cost-effective to run highly distributed processing frameworks such as Hadoop, Spark, and Presto when compared to on-premises deployments. Amazon EMR is flexible – you can run custom applications and code, and define specific compute, memory, storage, and application parameters to optimize your analytic requirements.
- In addition to running SQL queries, Amazon EMR can run a wide variety of scale-out data processing tasks for applications such as machine learning, graph analytics, data transformation, streaming data, and virtually anything you can code.
- You should use Amazon EMR if you use custom code to process and analyze enormous datasets with the latest big data processing frameworks such as Spark, Hadoop, Presto, or Hbase. Amazon EMR gives you complete control over the configuration of your clusters and the software installed on them.
- You can use Amazon Athena to query data you process using Amazon EMR.
- Amazon Athena supports many of the same data formats as Amazon EMR.
- Athena's data catalog is “Hive metastore” compatible.
- If you use EMR and already have a “Hive metastore”, you can run your DDL statements on Amazon Athena and query your data immediately without affecting your Amazon EMR jobs.
Amazon Redshift
- A data warehouse like Amazon Redshift is your best choice when you need to pull together data from many different sources – like inventory systems, financial systems, and retail sales systems – into a standard format and store it for long periods.
- If you want to build sophisticated business reports from historical data, a data warehouse like Amazon Redshift is the best choice.
- The query engine in Amazon Redshift has been optimized to perform exceedingly well on running complex queries that join large numbers of huge database tables.
- When you need to run queries against highly structured data with lots of joins across large tables, choose Amazon Redshift.