• Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools.
  • It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
  • You can use Amazon Redshift Spectrum to query data in Amazon S3 files without having to load the data into Amazon Redshift tables.
  • Amazon Redshift provides SQL capability designed for fast online analytical processing (OLAP) of very large datasets that are stored in both Amazon Redshift clusters and Amazon S3 data lakes.
  • You can query data in many formats, including Parquet, ORC, RCFile, TextFile, SequenceFile, RegexSerde, OpenCSV, and AVRO.
  • To define the structure of the files in Amazon S3, you create external schemas and tables
    • Then, you use an external data catalog such as AWS Glue or your own Apache Hive metastore.
    • Changes to either type of data catalog are immediately available to any of your Amazon Redshift clusters.
    • After your data is registered with an AWS Glue Data Catalog and enabled with AWS Lake Formation, you can query it by using Redshift Spectrum.
  • Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of your cluster.
    • Redshift Spectrum pushes many compute-intensive tasks, such as predicate filtering and aggregation, to the Redshift Spectrum layer.
    • Redshift Spectrum also scales intelligently to take advantage of massively parallel processing.
  • You can partition the external tables on one or more columns to optimize query performance through partition elimination.
    • You can query and join the external tables with Amazon Redshift tables.
    • You can access external tables from multiple Amazon Redshift clusters and query the Amazon S3 data from any cluster in the same AWS Region.
    • When you update Amazon S3 data files, the data is immediately available for queries from any of your Amazon Redshift clusters.