- Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools.
- It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
- You can use Amazon Redshift Spectrum to query data in Amazon S3 files without having to load the data into Amazon Redshift tables.
- Amazon Redshift provides SQL capability designed for fast online analytical processing (OLAP) of very large datasets that are stored in both Amazon Redshift clusters and Amazon S3 data lakes.
- You can query data in many formats, including Parquet, ORC, RCFile, TextFile, SequenceFile, RegexSerde, OpenCSV, and AVRO.
- To define the structure of the files in Amazon S3, you create external schemas and tables
- Then, you use an external data catalog such as AWS Glue or your own Apache Hive metastore.
- Changes to either type of data catalog are immediately available to any of your Amazon Redshift clusters.
- After your data is registered with an AWS Glue Data Catalog and enabled with AWS Lake Formation, you can query it by using Redshift Spectrum.
- Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of your cluster.
- Redshift Spectrum pushes many compute-intensive tasks, such as predicate filtering and aggregation, to the Redshift Spectrum layer.
- Redshift Spectrum also scales intelligently to take advantage of massively parallel processing.
- You can partition the external tables on one or more columns to optimize query performance through partition elimination.
- You can query and join the external tables with Amazon Redshift tables.
- You can access external tables from multiple Amazon Redshift clusters and query the Amazon S3 data from any cluster in the same AWS Region.
- When you update Amazon S3 data files, the data is immediately available for queries from any of your Amazon Redshift clusters.