Amazon Web Services Blog

AWS Compute Optimizer – Your Customized Resource Optimization Service

When I publicly speak about Amazon EC2 instance type, one frequently asked question I receive is “How can I be sure I chose the right instance type for my application?” Choosing the correct instance type is between art and science. It usually involves knowing your application performance characteristics under normal circumstances (the baseline) and the expected daily variations, and to pick up an instance type that matches these characteristics. After that, you monitor key metrics to validate your choice, and you iterate over time to adjust the instance type that best suits the cost vs performance ratio for your application. Over-provisioning resources results in paying too much for your infrastructure, and under-provisioning resources results in lower application performance, possibly impacting customer experience. Earlier this year, we launched Cost Explorer Rightsizing Recommendations, which helps you identify under-utilized Amazon Elastic Compute Cloud (EC2) instances that may be downsized within the same family to save money. We received great feedback and customers are asking for more recommendations beyond just downsizing within the same instance family. Today, we are announcing a new service to help you to optimize compute resources for your workloads: AWS Compute Optimizer. AWS Compute Optimizer uses machine learning techniques to analyze the history of resource consumption on your account, and make well-articulated and actionable recommendations tailored to your resource usage. AWS Compute Optimizer is integrated to AWS Organizations, you can view recommendations for multiple accounts from your master AWS Organizations account. To get started with AWS Compute Optimizer, I navigate to the AWS Management Console, select AWS Compute Optimizer, and activate the service. It immediately starts to analyze my resource usage and history using Amazon CloudWatch metrics and delivers the first recommendations a few hours later. I can see the first recommendations on the AWS Compute Optimizer dashboard: I click Over-provisioned: 8 instances to get the details: I click on one of the eight links to get the actionable findings: AWS Compute Optimizer offers multiple options. I scroll down the bottom of that page to verify what is the impact if I decide to apply this recommendation: I can also access the recommendation from the AWS Command Line Interface (CLI): $ aws compute-optimizer get-ec2-instance-recommendations --instance-arns arn:aws:ec2:us-east-1:012345678912:instance/i-0218a45abd8b53658 { "instanceRecommendations": [ { "instanceArn": "arn:aws:ec2:us-east-1:012345678912:instance/i-0218a45abd8b53658", "accountId": "012345678912", "currentInstanceType": "m5.xlarge", "finding": "OVER_PROVISIONED", "utilizationMetrics": [ { "name": "CPU", "statistic": "MAXIMUM", "value": 2.0 } ], "lookBackPeriodInDays": 14.0, "recommendationOptions": [ { "instanceType": "r5.large", "projectedUtilizationMetrics": [ { "name": "CPU", "statistic": "MAXIMUM", "value": 3.2 } ], "performanceRisk": 1.0, "rank": 1 }, { "instanceType": "t3.xlarge", "projectedUtilizationMetrics": [ { "name": "CPU", "statistic": "MAXIMUM", "value": 2.0 } ], "performanceRisk": 3.0, "rank": 2 }, { "instanceType": "m5.xlarge", "projectedUtilizationMetrics": [ { "name": "CPU", "statistic": "MAXIMUM", "value": 2.0 } ], "performanceRisk": 1.0, "rank": 3 } ], "recommendationSources": [ { "recommendationSourceArn": "arn:aws:ec2:us-east-1:012345678912:instance/i-0218a45abd8b53658", "recommendationSourceType": "Ec2Instance" } ], "lastRefreshTimestamp": 1575006953.102 } ], "errors": [] } Keep in mind that AWS Compute Optimizer uses Amazon CloudWatch metrics as basis for the recommendations. By default, CloudWatch metrics are the ones it can observe from an hypervisor point of view, such as CPU utilization, disk IO, and network IO. If I want AWS Compute Optimizer to take into account operating system level metrics, such as memory usage, I need to install a CloudWatch agent on my EC2 instance. AWS Compute Optimizer automatically recognizes these metrics when available and takes these into account when creating recommendation, otherwise, it shows “Data Unavailable” in the console. AWS customers told us performance is not the only metric they look at when choosing a resource, the price vs performance ratio is important too. For example, it might make sense to use a new generation instance family, such as m5, rather than the older generation (m3 or m4), even when the new generation seems over-provisioned for the workload. This is why, after AWS Compute Optimizer identifies a list of optimal AWS resources for your workload, it presents on-demand pricing, reserved instance pricing, reserved instance utilization, and reserved instance coverage, along with expected resource efficiency to its recommendations. AWS Compute Optimizer makes it easy to right-size your resource. However, keep in mind that while it is relatively easy to right-size resources for modern applications, or stateless applications that scale horizontally, it might be very difficult to right-size older apps. Some older apps might not run correctly under different hardware architecture, or need different drivers, or not be supported by the application vendor at all. Be sure to check with your vendor before trying to optimize cloud resources for packaged or older apps. We strongly advise you to thoroughly test your applications on the new recommended instance type before applying any recommendations into production. Compute Optimizer is free to use and available initially in these AWS Regions: US East (N. Virginia), US West (Oregon), Europe (Ireland), US East (Ohio), South America (São Paulo). Connect to the AWS Management Console today and discover how much you can save by choosing the right resource size for your cloud applications. -- seb

New for AWS Transit Gateway – Build Global Networks and Centralize Monitoring Using Network Manager

As your company grows and gets the benefits of a cloud-based infrastructure, your on-premises sites like offices and stores increasingly need high performance private connectivity to AWS and to other sites at a reasonable cost. Growing your network is hard, because traditional branch networks based on leased lines are costly, and they suffer from the same lack of elasticity and agility as traditional data centers. At the same time, it becomes increasingly complex to manage and monitor a global network that is spread across AWS Regions and on-premises sites. You need to stitch together data from these diverse locations. This results in an inconsistent operational experience, increased costs and efforts, and missed insights from the lack of visibility across different technologies. Today, we want to make it easier to build, manage, and monitor global networks with the following new capabilities for AWS Transit Gateway: Transit Gateway inter-region peering Accelerated site-to-site VPN AWS Transit Gateway Network Manager These new networking capabilities enable you to optimize your network using AWS’s global backbone, and to centrally visualize and monitor your global network. More specifically: Inter-region peering and accelerated VPN improve application performance by leveraging the AWS Global Network. In this way, you can reduce the number of leased-lines required to operate your network, optimizing your cost and improving agility. Transit Gateway inter-region peering sends inter region traffic privately over AWS’s global network backbone. Accelerated VPN uses AWS Global Accelerator to route VPN traffic from remote locations through the closest AWS edge location to improve connection performance. Network Manager reduces the operational complexity of managing a global network across AWS and on-premises. With Network Manager, you set up a global view of your private network simply by registering your Transit Gateways and on-premises resources. Your global network can then be visualized and monitored via a centralized operational dashboard. These features allow you to optimize connectivity from on-premises sites to AWS and also between on-premises sites, by routing traffic through Transit Gateways and the AWS Global Network, and centrally managing through Network Manager. Visualizing Your Global Network In the Network Manager console, that you can reach from the Transit Gateways section of the Amazon Virtual Private Cloud console, you have an overview of your global networks. Each global network includes AWS and on-premises resources. Specifically, it provides a central point of management for your AWS Transit Gateways, your physical devices and sites connected to the Transit Gateways via Site-to-Site VPN Connections, and AWS Direct Connect locations attached to the Transit Gateways. For example, this is the Geographic view of a global network covering North America and Europe with 5 Transit Gateways in 3 AWS Regions, 80 VPCs, 50 VPNs, 1 Direct Connect location, and 16 on-premises sites with 50 devices: As I zoom in the map, I get a description on what these nodes represent, for example if they are AWS Regions, Direct Connect locations, or branch offices. I can select any node in the map to get more information. For example, I select the US West (Oregon) AWS Region to see the details of the two Transit Gateways I am using there, including the state of all VPN connections, VPCs, and VPNs handled by the selected Transit Gateway. Selecting a site, I get a centralized view with the status of the VPN connections, including site metadata such as address, location, and description. For example, here are the details of the Colorado branch offices. In the Topology panel, I see the logical relationship of all the resources in my network. On the left here, there is the entire topology of my global network, on the right the detail of the European part. Connections statuses are reported in color in the topology view. Selecting any node in the topology map displays details specific to the resource type (Transit Gateway, VPC, customer gateway, and so on) including links to the corresponding service in the AWS console to get more information and configure the resource. Monitoring Your Global Network Network Manager is using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics for data in/out, packets dropped, and VPN connection status. These statistics are kept for 15 months, so that you can access historical information and gain a better perspective on how your web application or service is performing. You can also set alarms that watch for certain thresholds, and send notifications or take actions when those thresholds are met. For example, these are the last 12 hours of Monitoring for the Transit Gateway in Europe (Ireland). In the global network view, you have a single point of view of all events affecting your network, simplifying root cause analysis in case of issues. Clicking on any of the messages in the console will take to a more detailed view in the Events tab. Your global network events are also delivered by CloudWatch Events. Using simple rules that you can quickly set up, you can match events and route them to one or more target functions or streams. To process the same events, you can also use the additional capabilities offered by Amazon EventBridge. Network Manager sends the following types of events: Topology changes, for example when a VPN connection is created for a transit gateway. Routing updates, such as when a route is deleted in a transit gateway route table. Status updates, for example in case a VPN tunnel’s BGP session goes down. Configuring Your Global Network To get your on-premises resources included in the above visualizations and monitoring, you need to input into Network Manager information about your on-premises devices, sites, and links. You also need to associate devices with the customer gateways they host for VPN connections. Our software-defined wide area network (SD-WAN) partners, such as Cisco, Aruba, Silver Peak, and Aviatrix, have configured their SD-WAN devices to connect with Transit Gateway Network Manager in only a few clicks. Their SD-WANs also define the on-premises devices, sites, and links automatically in Network Manager. SD-WAN integrations enable to include your on-premises network in the Network Manager global dashboard view without requiring you to input information manually. Available Now AWS Transit Gateway Network Manager is a global service available for Transit Gateways in the following regions: US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Europe (Ireland), Europe (Frankfurt), Europe (London), Europe (Paris), Asia Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Sydney), Asia Pacific (Mumbai), Canada (Central), South America (São Paulo). There is no additional cost for using Network Manager. You pay for the network resources you use, like Transit Gateways, VPNs, and so on. Here you can find more information on pricing for VPN and Transit Gateway. You can learn more in the documentation of the Network Manager, inter-region peering, and accelerated VPN. With these new features, you can take advantage of the performance of our AWS Global Network, and simplify network management and monitoring across your AWS and on-premises resources. — Danilo

New – VPC Ingress Routing – Simplifying Integration of Third-Party Appliances

When I was delivering the Architecting on AWS class, customers often asked me how to configure an Amazon Virtual Private Cloud to enforce the same network security policies in the cloud as they have on-premises. For example, to scan all ingress traffic with an Intrusion Detection System (IDS) appliance or to use the same firewall in the cloud as on-premises. Until today, the only answer I could provide was to route all traffic back from their VPC to an on-premises appliance or firewall in order to inspect the traffic with their usual networking gear before routing it back to the cloud. This is obviously not an ideal configuration, it adds latency and complexity. Today, we announce new VPC networking routing primitives to allow to route all incoming and outgoing traffic to/from an Internet Gateway (IGW) or Virtual Private Gateway (VGW) to a specific EC2 instance’s Elastic Network Interface. It means you can now configure your Virtual Private Cloud to send all traffic to an EC2 instance before the traffic reaches your business workloads. The instance typically runs network security tools to inspect or to block suspicious network traffic (such as IDS/IPS or Firewall) or to perform any other network traffic inspection before relaying the traffic to other EC2 instances. How Does it Work? To learn how it works, I wrote this CDK script to create a VPC with two public subnets: one subnet for the appliance and one subnet for a business application. The script launches two EC2 instances with public IP address, one in each subnet. The script creates the below architecture: This is a regular VPC, the subnets have routing tables to the Internet Gateway and the traffic flows in and out as expected. The application instance hosts a static web site, it is accessible from any browser. You can retrieve the application public DNS name from the EC2 Console (for your convenience, I also included the CLI version in the comments of the CDK script). AWS_REGION=us-west-2 APPLICATION_IP=$(aws ec2 describe-instances \ --region $AWS_REGION \ --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='application']].NetworkInterfaces[].Association.PublicDnsName" \ --output text) curl -I $APPLICATION_IP Configure Routing To configure routing, you need to know the VPC ID, the ENI ID of the ENI attached to the appliance instance, and the Internet Gateway ID. Assuming you created the infrastructure using the CDK script I provided, here are the commands I use to find these three IDs (be sure to adjust to the AWS region you use): AWS_REGION=us-west-2 VPC_ID=$(aws cloudformation describe-stacks \ --region $AWS_REGION \ --stack-name VpcIngressRoutingStack \ --query "Stacks[].Outputs[?OutputKey=='VPCID'].OutputValue" \ --output text) ENI_ID=$(aws ec2 describe-instances \ --region $AWS_REGION \ --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='appliance']].NetworkInterfaces[].NetworkInterfaceId" \ --output text) IGW_ID=$(aws ec2 describe-internet-gateways \ --region $AWS_REGION \ --query "InternetGateways[] | [?Attachments[?VpcId=='${VPC_ID}']].InternetGatewayId" \ --output text) To route all incoming traffic through my appliance, I create a routing table for the Internet Gateway and I attach a rule to direct all traffic to the EC2 instance Elastic Network Interface (ENI): # create a new routing table for the Internet Gateway ROUTE_TABLE_ID=$(aws ec2 create-route-table \ --region $AWS_REGION \ --vpc-id $VPC_ID \ --query "RouteTable.RouteTableId" \ --output text) # create a route for pointing to the appliance ENI aws ec2 create-route \ --region $AWS_REGION \ --route-table-id $ROUTE_TABLE_ID \ --destination-cidr-block \ --network-interface-id $ENI_ID # associate the routing table to the Internet Gateway aws ec2 associate-route-table \ --region $AWS_REGION \ --route-table-id $ROUTE_TABLE_ID \ --gateway-id $IGW_ID Alternatively, I can use the VPC Console under the new Edge Associations tab. To route all application outgoing traffic through the appliance, I replace the default route for the application subnet to point to the appliance’s ENI: SUBNET_ID=$(aws ec2 describe-instances \ --region $AWS_REGION \ --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='application']].NetworkInterfaces[].SubnetId" \ --output text) ROUTING_TABLE=$(aws ec2 describe-route-tables \ --region $AWS_REGION \ --query "RouteTables[?VpcId=='${VPC_ID}'] | [?Associations[?SubnetId=='${SUBNET_ID}']].RouteTableId" \ --output text) # delete the existing default route (the one pointing to the internet gateway) aws ec2 delete-route \ --region $AWS_REGION \ --route-table-id $ROUTING_TABLE \ --destination-cidr-block # create a default route pointing to the appliance's ENI aws ec2 create-route \ --region $AWS_REGION \ --route-table-id $ROUTING_TABLE \ --destination-cidr-block \ --network-interface-id $ENI_ID aws ec2 associate-route-table \ --region $AWS_REGION \ --route-table-id $ROUTING_TABLE \ --subnet-id $SUBNET_ID Alternatively, I can use the VPC Console. Within the correct routing table, I select the Routes tab and click Edit routes to replace the default route (the one pointing to to target the appliance’s ENI. Now I have the routing configuration in place. The new routing looks like: Configure the Appliance Instance Finally, I configure the appliance instance to forward all traffic it receives. Your software appliance usually does that for you, no extra step is required when you use AWS Marketplace appliances. When using a plain Linux instance, two extra steps are required: 1. Connect to the EC2 appliance instance and configure IP traffic forwarding in the kernel: APPLIANCE_ID=$(aws ec2 describe-instances \ --region $AWS_REGION \ --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='appliance']].InstanceId" \ --output text) aws ssm start-session --region $AWS_REGION --target $APPLIANCE_ID ## ## once connected (you see the 'sh-4.2$' prompt), type: ## sudo sysctl -w net.ipv4.ip_forward=1 sudo sysctl -w net.ipv6.conf.all.forwarding=1 exit 2. Configure the EC2 instance to accept traffic for different destinations than itself (known as Dest/Source check) : aws ec2 modify-instance-attribute --region $AWS_REGION \ --no-source-dest-check \ --instance-id $APPLIANCE_ID Now, the appliance is ready to forward traffic to the other EC2 instances. You can test this by pointing your browser (or using `cURL`) to the application instance. APPLICATION_IP=$(aws ec2 describe-instances --region $AWS_REGION \ --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='application']].NetworkInterfaces[].Association.PublicDnsName" \ --output text) curl -I $APPLICATION_IP To verify the traffic is really flowing through the appliance, you can enable source/destination check on the instance again (use --source-dest-check parameter with the modify-instance-attributeCLI command above). The traffic is blocked when Source/Destination check is enabled. Cleanup Should you use the CDK script I provided for this article, be sure to run cdk destroy when finished. This ensures you are not billed for the two EC2 instances I use for this demo. As I modified routing tables behind the back of AWS CloudFormation, I need to manually delete the routing tables, the subnet, and the VPC. The easiest is to navigate to the VPC Console, select the VPC and click Actions => Delete VPC. The console deletes all components in the correct order. You might need to wait 5-10 minutes after the end of cdk destroy before the console is able to delete the VPC. Availability There are no additional costs to use Virtual Private Cloud ingress routing. It is available in all AWS Regions (including AWS GovCloud (US-West)) and you can start to use it today. You can learn more about gateway routing tables in the updated VPC documentation. What are the appliances you are going to use with this new VPC routing capability? -- seb

Amazon EC2 Update – Inf1 Instances with AWS Inferentia Chips for High Performance Cost-Effective Inferencing

Our customers are taking to Machine Learning in a big way. They are running many different types of workloads, including object detection, speech recognition, natural language processing, personalization, and fraud detection. When running on large-scale production workloads, it is essential that they can perform inferencing as quickly and as cost-effectively as possible. According to what they have told us, inferencing can account for up to 90% of the cost of their machine learning work. New Inf1 Instances Today we are launching Inf1 instances in four sizes. These instances are powered by AWS Inferentia chips, and are designed to provide you with fast, low-latency inferencing. AWS Inferentia chips are designed to accelerate the inferencing process. Each chip can deliver the following performance: 64 teraOPS on 16-bit floating point (FP16 and BF16) and mixed-precision data. 128 teraOPS on 8-bit integer (INT8) data. The chips also include a high-speed interconnect, and lots of memory. With 16 chips on the largest instance, your new and existing TensorFlow, PyTorch, and MxNet inferencing workloads can benefit from over 2 petaOPS of inferencing power. When compared to the G4 instances, the Inf1 instances offer up to 3x the inferencing throughput, and up to 40% lower cost per inference. Here are the sizes and specs: Instance Name Inferentia Chips vCPUs RAM EBS Bandwidth Network Bandwidth inf1.xlarge 1 4 8 GiB Up to 3.5 Gbps Up to 25 Gbps inf1.2xlarge 1 8 16 GiB Up to 3.5 Gbps Up to 25 Gbps inf1.6xlarge 4 24 48 GiB 3.5 Gbps 25 Gbps inf1.24xlarge 16 96 192 GiB 14 Gbps 100 Gbps The instances make use of custom Second Generation Intel® Xeon® Scalable (Cascade Lake) processors, and are available in On-Demand, Spot, and Reserved Instance form, or as part of a Savings Plan in the US East (N. Virginia) and US West (Oregon) Regions. You can launch the instances directly, and they will also be available soon through Amazon SageMaker and Amazon ECS, and Amazon Elastic Kubernetes Service. Using Inf1 Instances Amazon Deep Learning AMIs have been updated and contain versions of TensorFlow and MxNet that have been optimized for use in Inf1 instances, with PyTorch coming very soon. The AMIs contain the new AWS Neuron SDK, which contains commands to compile, optimize, and execute your ML models on the Inferentia chip. You can also include the SDK in your own AMIs and images. You can build and train your model on a GPU instance such as a P3 or P3dn, and then move it to an Inf1 instance for production use. You can use a model natively trained in FP16, or you can use models that have been trained to 32 bits of precision and have AWS Neuron automatically convert them to BF16 form. Large models, such as those for language translation or natural language processing, can be split across multiple Inferentia chips in order to reduce latency. The AWS Neuron SDK also allows you to assign models to Neuron Compute Groups, and to run them in parallel. This allows you to maximize hardware utilization and to use multiple models as part of Neuron Core Pipeline mode, taking advantage of the large on-chip cache on each Inferentia chip. Be sure to read the AWS Neuron SDK Tutorials to learn more! — Jeff;  

AWS Outposts Now Available – Order Yours Today!

We first discussed AWS Outposts at re:Invent 2018. Today, I am happy to announce that we are ready to take orders and install Outposts racks in your data center or colo facility. Why Outposts? This new and unique AWS offering is a comprehensive, single-vendor compute & storage solution that is designed to meet the needs of customers who need local processing and very low latency. You no longer need to spend time creating detailed hardware specifications, soliciting & managing bids from multiple disparate vendors, or racking & stacking individual servers. Instead, you place your order online, take delivery, and relax while trained AWS technicians install, connect, set up, and verify your Outposts. Once installed, we take care of monitoring, maintaining, and upgrading your Outposts. All of the hardware is modular and can be replaced in the field without downtime. When you need more processing or storage, or want to upgrade to newer generations of EC2 instances, you can initiate the request with a couple of clicks and we will take care of the rest. Everything that you and your team already know about AWS still applies. You use the same APIs, tools, and operational practices. You can create a single deployment pipeline that target your Outposts and your cloud-based environments, and you can create hybrid architectures that span both. Each Outpost is connected to and controlled by a specific AWS Region. The region treats a collection of up to 16 racks at a single location as a unified capacity pool. The collection can be associated with subnets of one or more VPCs in the parent region. Outposts Hardware The Outposts hardware is the same as what we use in our own data centers, with some additional security devices. The hardware is designed for reliability & efficiency, with redundant network switches and power supplies, and DC power distribution. Outpost racks are 80″ tall, 24″ wide, 48″ deep, and can weigh up to 2000 lbs. They arrive fully assembled, and roll in on casters, ready for connection to power and networking. To learn more about the Outposts hardware, watch my colleague Anthony Liguori explain it: Outposts supports multiple Intel®-powered Nitro-based EC2 instance types including C5, C5d, M5, M5d, R5, R5d, G4, and I3en. You can choose the mix of types that is right for your environment, and you can add more later. You will also be able to upgrade to newer instance types as they become available. On the storage side, Outposts support EBS gp2 (general purpose SSD) storage, with a minimum size of 2.7 TB. Outpost Networking Each Outpost has a pair of networking devices, each with 400 Gbps of connectivity and support for 1 GigE, 10 GigE, 40 GigE, and 100 Gigabit fiber connections. The connections are used to host a pair of Link Aggregation Groups, one for the link to the parent region, and another to your local network. The link to the parent region is used for control and VPC traffic; all connections originate from the Outpost. Traffic to and from your local network flows through a Local Gateway (LGW), giving you full control over access and routing. Here’s an overview of the networking topology within your premises: You will need to allocate a /26 CIDR block to each Outpost, which is advertised as a pair of /27 blocks in order to protect against device and link failures. The CIDR block can be within your own range of public IP addresses, or it can be an RFC 1918 private address plus NAT at your network edge. Outposts are simply new subnets on an existing VPC in the parent region. Here’s how to create one: $ aws ec2 create-subnet --vpc-id VVVVVV \ --cidr-block A.B.C.D/24 \ --outpost-arn arn:aws:outposts:REGION:ACCOUNT_ID:outpost:OUTPOST_ID If you have Cisco or Juniper hardware in your data center, the following guides will be helpful: Cisco – Outposts Solution Overview. To learn more about the partnership between AWS and Cisco, visit this page. Juniper – AWS Outposts in a Juniper QFX-Based Datacenter. In most cases you will want to use AWS Direct Connect to establish a connection between your Outposts and the parent AWS Region. For more information on this and to learn a lot more about how to plan your Outposts network model, consult the How it Works documentation. Outpost Services We are launching with support for Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS), Amazon Virtual Private Cloud, Amazon ECS, Amazon Elastic Kubernetes Service, and Amazon EMR, with additional services in the works. Amazon RDS for PostgreSQL and Amazon RDS for MySQL are available in preview form. Your applications can also make use of any desired services in the parent region, including Amazon Simple Storage Service (S3), Amazon DynamoDB, Auto Scaling, AWS CloudFormation, Amazon CloudWatch, AWS CloudTrail, AWS Config, Load Balancing, and so forth. You can create and use Interface Endpoints from within the VPC, or you can access the services through the regional public endpoints. Services & applications in the parent region that launch, manage, or refer to EC2 instances or EBS volumes can operate on those objects within an Outpost with no changes. Purchasing an Outpost The process of purchasing an Outpost is a bit more involved than that of launching an EC2 instance or creating an S3 bucket, but it should be straightforward. I don’t actually have a data center, and won’t actually take delivery of an Outpost, but I’ll do my best to show you the actual experience! The first step is to describe and qualify my site. I enter my address: I confirm temperature, humidity, and airflow at the rack position, that my loading dock can accommodate the shipping crate, and that there’s a clear access path from the loading dock to the rack’s final resting position: I provide information about my site’s power configuration: And the networking configuration: After I create the site, I create my Outpost: Now I am ready to order my hardware. I can choose any one of 18 standard configurations, with varied amounts of compute capacity and storage (custom configurations are also available), and click Create order to proceed: The EC2 capacity shown above indicates the largest instance size of a particular type. I can launch instances of that size, or I can use the smaller sizes, as needed. For example, the the capacity of the OR-HUZEI16 configuration that I selected is listed as 7 m5.24xlarge instances and 3 c5.24xlarge instances. I could launch a total of 10 instances in those sizes, or (if I needed lots of smaller ones) I could launch 168 m5.xlarge instances and 72 c5.xlarge instances. I could also use a variety of sizes, subject to available capacity and the details of how the instances are assigned to the hardware. I confirm my order, choose the Outpost that I created earlier, and click Submit order: My order will be reviewed, my colleagues might give me a call to review some details, and my Outpost will be shipped to my site. A team of AWS installers will arrive to unpack & inspect the Outpost, transport it to its resting position in my data center, and work with my data center operations (DCO) team to get it connected and powered up. Once the Outpost is powered up and the network is configured, it will set itself up automatically. At that point I can return to the console and monitor capacity exceptions (situations where demand exceeds supply), capacity availability, and capacity utilization: Using an Outpost The next step is to set up one or more subnets in my Outpost, as shown above. Then I can launch EC2 instances and create EBS volumes in the subnet, just as I would with any other VPC subnet. I can ask for more capacity by selecting Increase capacity from the Actions menu: The AWS team will contact me within 3 business days to discuss my options. Things to Know Here are a couple of other things to keep in mind when thinking about using Outposts: Availability – Outposts are available in the following countries: North America (United States) Europe (All EU countries, Switzerland, Norway) Asia Pacific (Japan, South Korea, Australia) Support – You must subscribe to AWS Enterprise Support in order to purchase an Outpost. We will remotely monitor your Outpost, and keep it happy & healthy over time. We’ll look for failing components and arrange to replace them without disturbing your operations. Billing & Payment Options – You can purchase Outposts on a three-year term, with All Upfront, Partial Upfront, and No Upfront payment options. The purchase price covers all EC2 and EBS usage within the Outpost; other services are billed by the hour, with the EC2 and EBS portions removed. You pay the regular inter-AZ data transfer charge to move data between an Outpost and another subnet in the same VPC, and the usual AWS data transfer charge for data that exits to the Internet across the link to the parent region. Capacity Expansion – Today, you can group up to 16 racks into a single capacity pool. Over time we expect to allow you to group thousands of racks together in this manner. Stay Tuned This is, like most AWS announcements, just the starting point. We have a lot of cool stuff in the works, and it is still Day One for AWS Outposts! — Jeff;  

AWS Now Available from a Local Zone in Los Angeles

AWS customers are always asking for more features, more bandwidth, more compute power, and more memory, while also asking for lower latency and lower prices. We do our best to meet these competing demands: we launch new EC2 instance types, EBS volume types, and S3 storage classes at a rapid pace, and we also reduce prices regularly. AWS in Los Angeles Today we are launching a Local Zone in Los Angeles, California. The Local Zone is a new type of AWS infrastructure deployment that brings select AWS services very close to a particular geographic area. This Local Zone is designed to provide very low latency (single-digit milliseconds) to applications that are accessed from Los Angeles and other locations in Southern California. It will be of particular interest to highly-demanding applications that are particularly sensitive to latency. This includes: Media & Entertainment – Gaming, 3D modeling & rendering, video processing (including real-time color correction), video streaming, and media production pipelines. Electronic Design Automation – Interactive design & layout, simulation, and verification. Ad-Tech – Rapid decision making & ad serving. Machine Learning – Fast, continuous model training; high-performance low-latency inferencing. All About Local Zones The new Local Zone in Los Angeles is a logical part of the US West (Oregon) Region (which I will refer to as the parent region), and has some unique and interesting characteristics: Naming – The Local Zone can be accessed programmatically as us-west-2-lax-1a. All API, CLI, and Console access takes place through the us-west-2 API endpoint and the US West (Oregon) Console. Opt-In – You will need to opt in to the Local Zone in order to use it. After opting in, you can create a new VPC subnet in the Local Zone, taking advantage of all relevant VPC features including Security Groups, Network ACLs, and Route Tables. You can target the Local Zone when you launch EC2 instances and other resources, or you can create a default subnet in the VPC and have it happen automatically. Networking – The Local Zone in Los Angeles is connected to US West (Oregon) over Amazon’s private backbone network. Connections to the public internet take place across an Internet Gateway, giving you local ingress and egress to reduce latency. Elastic IP Addresses can be shared by a group of Local Zones in a particular geographic location, but they do not move between a Local Zone and the parent region. The Local Zone also supports AWS Direct Connect, giving you the opportunity to route your traffic over a private network connection. Services – We are launching with support for seven EC2 instance types (T3, C5, M5, R5, R5d, I3en, and G4), two EBS volume types (io1 and gp2), Amazon FSx for Windows File Server, Amazon FSx for Lustre, Application Load Balancer, and Amazon Virtual Private Cloud. Single-Zone RDS is on the near-term roadmap, and other services will come later based on customer demand. Applications running in a Local Zone can also make use of services in the parent region. Parent Region – As I mentioned earlier, the new Local Zone is a logical extension of the US West (Oregon) region, and is managed by the “control plane” in the region. API calls, CLI commands, and the AWS Management Console should use “us-west-2” or US West (Oregon). AWS – Other parts of AWS will continue to work as expected after you start to use this Local Zone. Your IAM resources, CloudFormation templates, and Organizations are still relevant and applicable, as are your tools and (perhaps most important) your investment in AWS training. Pricing & Billing – Instances and other AWS resources in Local Zones will have different prices than in the parent region. Billing reports will include a prefix that is specific to a group of Local Zones that share a physical location. EC2 instances are available in On Demand & Spot form, and you can also purchase Savings Plans. Using a Local Zone The first Local Zone is available today, and you can request access here: In early 2020, you will be able opt in using the console, CLI, or by API call. After opting in, I can list my AZs and see that the Local Zone is included: Then I create a new VPC subnet for the Local Zone. This gives me transparent, seamless connectivity between the parent zone in Oregon and the Local Zone in Los Angeles, all within the VPC: I can create EBS volumes: They are, as usual, ready within seconds: I can also see and use the Local Zone from within the AWS Management Console: I can also use the AWS APIs, CloudFormation templates, and so forth. Thinking Ahead Local Zones give you even more architectural flexibility. You can think big, and you can think different! You now have the components, tools, and services at your fingertips to build applications that make use of any conceivable combination of legacy on-premises resources, modern on-premises cloud resources via AWS Outposts, resources in a Local Zone, and resources in one or more AWS regions. In the fullness of time (as Andy Jassy often says), there could very well be more than one Local Zone in any given geographic area. In 2020, we will open a second one in Los Angeles (us-west-2-lax-1b), and are giving consideration to other locations. We would love to get your advice on locations, so feel free to leave me a comment or two! Now Available The Local Zone in Los Angeles is available now and you can start using it today. Learn more about Local Zones. — Jeff;  

Amazon SageMaker Studio: The First Fully Integrated Development Environment For Machine Learning

Today, we’re extremely happy to launch Amazon SageMaker Studio, the first fully integrated development environment (IDE) for machine learning (ML). We have come a long way since we launched Amazon SageMaker in 2017, and it is shown in the growing number of customers using the service. However, the ML development workflow is still very iterative, and is challenging for developers to manage due to the relative immaturity of ML tooling. Many of the tools which developers take for granted when building traditional software (debuggers, project management, collaboration, monitoring, and so forth) have yet been invented for ML. For example, when trying a new algorithm or tweaking hyper parameters, developers and data scientists typically run hundreds and thousands of experiments on Amazon SageMaker, and they need to manage all this manually. Over time, it becomes much harder to track the best performing models, and to capitalize on lessons learned during the course of experimentation. Amazon SageMaker Studio unifies at last all the tools needed for ML development. Developers can write code, track experiments, visualize data, and perform debugging and monitoring all within a single, integrated visual interface, which significantly boosts developer productivity. In addition, since all these steps of the ML workflow are tracked within the environment, developers can quickly move back and forth between steps, and also clone, tweak, and replay them. This gives developers the ability to make changes quickly, observe outcomes, and iterate faster, reducing the time to market for high quality ML solutions. Introducing Amazon SageMaker Studio Amazon SageMaker Studio lets you manage your entire ML workflow through a single pane of glass. Let me give you the whirlwind tour! With Amazon SageMaker Notebooks (currently in preview), you can enjoy an enhanced notebook experience that lets you easily create and share Jupyter notebooks. Without having to manage any infrastructure, you can also quickly switch from one hardware configuration to another. With Amazon SageMaker Experiments, you can organize, track and compare thousands of ML jobs: these can be training jobs, or data processing and model evaluation jobs run with Amazon SageMaker Processing. With Amazon SageMaker Debugger, you can debug and analyze complex training issues, and receive alerts. It automatically introspects your models, collects debugging data, and analyzes it to provide real-time alerts and advice on ways to optimize your training times, and improve model quality. All information is visible as your models are training. With Amazon SageMaker Model Monitor, you can detect quality deviations for deployed models, and receive alerts. You can easily visualize issues like data drift that could be affecting your models. No code needed: all it takes is a few clicks. With Amazon SageMaker Autopilot, you can build models automatically with full control and visibility. Algorithm selection, data preprocessing, and model tuning are taken care automatically, as well as all infrastructure. Thanks to these new capabilities, Amazon SageMaker now covers the complete ML workflow to build, train, and deploy machine learning models, quickly and at any scale. These services mentioned above, except for Amazon SageMaker Notebooks, are covered in individual blog posts (see below) showing you how to quickly get started, so keep your eyes peeled and read on! Amazon SageMaker Debugger Amazon SageMaker Model Monitor Amazon SageMaker Autopilot Amazon SageMaker Experiments Now Available! Amazon SageMaker Studio is available today in US East (Ohio). Give it a try, and please send us feedback either in the AWS forum for Amazon SageMaker, or through your usual AWS support contacts. - Julien

Amazon SageMaker Debugger – Debug Your Machine Learning Models

Today, we’re extremely happy to announce Amazon SageMaker Debugger, a new capability of Amazon SageMaker that automatically identifies complex issues developing in machine learning (ML) training jobs. Building and training ML models is a mix of science and craft (some would even say witchcraft). From collecting and preparing data sets to experimenting with different algorithms to figuring out optimal training parameters (the dreaded hyperparameters), ML practitioners need to clear quite a few hurdles to deliver high-performance models. This is the very reason why be built Amazon SageMaker : a modular, fully managed service that simplifies and speeds up ML workflows. As I keep finding out, ML seems to be one of Mr. Murphy’s favorite hangouts, and everything that may possibly go wrong often does! In particular, many obscure issues can happen during the training process, preventing your model from correctly extracting and learning patterns present in your data set. I’m not talking about software bugs in ML libraries (although they do happen too): most failed training jobs are caused by an inappropriate initialization of parameters, a poor combination of hyperparameters, a design issue in your own code, etc. To make things worse, these issues are rarely visible immediately: they grow over time, slowly but surely ruining your training process, and yielding low accuracy models. Let’s face it, even if you’re a bonafide expert, it’s devilishly difficult and time-consuming to identify them and hunt them down, which is why we built Amazon SageMaker Debugger. Let me tell you more. Introducing Amazon SageMaker Debugger In your existing training code for TensorFlow, Keras, Apache MXNet, PyTorch and XGBoost, you can use the new SageMaker Debugger SDK to save internal model state at periodic intervals; as you can guess, it will be stored in Amazon Simple Storage Service (S3). This state is composed of: The parameters being learned by the model, e.g. weights and biases for neural networks, The changes applied to these parameters by the optimizer, aka gradients, The optimization parameters themselves, Scalar values, e.g. accuracies and losses, The output of each layer, Etc. Each specific set of values – say, the sequence of gradients flowing over time through a specific neural network layer – is saved independently, and referred to as a tensor. Tensors are organized in collections (weights, gradients, etc.), and you can decide which ones you want to save during training. Then, using the SageMaker SDK and its estimators, you configure your training job as usual, passing additional parameters defining the rules you want SageMaker Debugger to apply. A rule is a piece of Python code that analyses tensors for the model in training, looking for specific unwanted conditions. Pre-defined rules are available for common problems such as exploding/vanishing tensors (parameters reaching NaN or zero values), exploding/vanishing gradients, loss not changing, and more. Of course, you can also write your own rules. Once the SageMaker estimator is configured, you can launch the training job. Immediately, it fires up a debug job for each rule that you configured, and they start inspecting available tensors. If a debug job detects a problem, it stops and logs additional information. A CloudWatch Events event is also sent, should you want to trigger additional automated steps. So now you know that your deep learning job suffers from say, vanishing gradients. With a little brainstorming and experience, you’ll know where to look: maybe the neural network is too deep? Maybe your learning rate is too small? As the internal state has been saved to S3, you can now use the SageMaker Debugger SDK to explore the evolution of tensors over time, confirm your hypothesis and fix the root cause. Let’s see SageMaker Debugger in action with a quick demo. Debugging Machine Learning Models with Amazon SageMaker Debugger At the core of SageMaker Debugger is the ability to capture tensors during training. This requires a little bit of instrumentation in your training code, in order to select the tensor collections you want to save, the frequency at which you want to save them, and whether you want to save the values themselves or a reduction (mean, average, etc.). For this purpose, the SageMaker Debugger SDK provides simple APIs for each framework that it supports. Let me show you how this works with a simple TensorFlow script, trying to fit a 2-dimension linear regression model. Of course, you’ll find more examples in this Github repository. Let’s take a look at the initial code: import argparse import numpy as np import tensorflow as tf import random parser = argparse.ArgumentParser() parser.add_argument('--model_dir', type=str, help="S3 path for the model") parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001) parser.add_argument('--steps', type=int, help="Number of steps to run", default=100) parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0) args = parser.parse_args() with tf.name_scope('initialize'): # 2-dimensional input sample x = tf.placeholder(shape=(None, 2), dtype=tf.float32) # Initial weights: [10, 10] w = tf.Variable(initial_value=[[10.], [10.]], name='weight1') # True weights, i.e. the ones we're trying to learn w0 = [[1], [1.]] with tf.name_scope('multiply'): # Compute true label y = tf.matmul(x, w0) # Compute "predicted" label y_hat = tf.matmul(x, w) with tf.name_scope('loss'): # Compute loss loss = tf.reduce_mean((y_hat - y) ** 2, name="loss") optimizer = tf.train.AdamOptimizer( optimizer_op = optimizer.minimize(loss) with tf.Session() as sess: for i in range(args.steps): x_ = np.random.random((10, 2)) * args.scale _loss, opt =[loss, optimizer_op], {x: x_}) print (f'Step={i}, Loss={_loss}') Let’s train this script using the TensorFlow Estimator. I’m using SageMaker local mode, which is a great way to quickly iterate on experimental code. bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000} estimator = TensorFlow( role=sagemaker.get_execution_role(), base_job_name='debugger-simple-demo', train_instance_count=1, train_instance_type='local', entry_point='', framework_version='1.13.1', py_version='py3', script_mode=True, hyperparameters=bad_hyperparameters) Looking at the training log, things did not go well. Step=0, Loss=7.883463958023267e+23 algo-1-hrvqg_1 | Step=1, Loss=9.502028841062608e+23 algo-1-hrvqg_1 | Step=2, Loss=nan algo-1-hrvqg_1 | Step=3, Loss=nan algo-1-hrvqg_1 | Step=4, Loss=nan algo-1-hrvqg_1 | Step=5, Loss=nan algo-1-hrvqg_1 | Step=6, Loss=nan algo-1-hrvqg_1 | Step=7, Loss=nan algo-1-hrvqg_1 | Step=8, Loss=nan algo-1-hrvqg_1 | Step=9, Loss=nan Loss does not decrease at all, and even goes to infinity… This looks like an exploding tensor problem, which is one of the built-in rules defined in SageMaker Debugger. Let’s get to work. Using the Amazon SageMaker Debugger SDK In order to capture tensors, I need to instrument the training script with: A SaveConfig object specifying the frequency at which tensors should be saved, A SessionHook object attached to the TensorFlow session, putting everything together and saving required tensors during training, An (optional) ReductionConfig object, listing tensor reductions that should be saved instead of full tensors, An (optional) optimizer wrapper to capture gradients. Here’s the updated code, with extra command line arguments for SageMaker Debugger parameters. import argparse import numpy as np import tensorflow as tf import random import smdebug.tensorflow as smd parser = argparse.ArgumentParser() parser.add_argument('--model_dir', type=str, help="S3 path for the model") parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001 ) parser.add_argument('--steps', type=int, help="Number of steps to run", default=100 ) parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0 ) parser.add_argument('--debug_path', type=str, default='/opt/ml/output/tensors') parser.add_argument('--debug_frequency', type=int, help="How often to save tensor data", default=10) feature_parser = parser.add_mutually_exclusive_group(required=False) feature_parser.add_argument('--reductions', dest='reductions', action='store_true', help="save reductions of tensors instead of saving full tensors") feature_parser.add_argument('--no_reductions', dest='reductions', action='store_false', help="save full tensors") args = parser.parse_args() args = parser.parse_args() reduc = smd.ReductionConfig(reductions=['mean'], abs_reductions=['max'], norms=['l1']) if args.reductions else None hook = smd.SessionHook(out_dir=args.debug_path, include_collections=['weights', 'gradients', 'losses'], save_config=smd.SaveConfig(save_interval=args.debug_frequency), reduction_config=reduc) with tf.name_scope('initialize'): # 2-dimensional input sample x = tf.placeholder(shape=(None, 2), dtype=tf.float32) # Initial weights: [10, 10] w = tf.Variable(initial_value=[[10.], [10.]], name='weight1') # True weights, i.e. the ones we're trying to learn w0 = [[1], [1.]] with tf.name_scope('multiply'): # Compute true label y = tf.matmul(x, w0) # Compute "predicted" label y_hat = tf.matmul(x, w) with tf.name_scope('loss'): # Compute loss loss = tf.reduce_mean((y_hat - y) ** 2, name="loss") hook.add_to_collection('losses', loss) optimizer = tf.train.AdamOptimizer( optimizer = hook.wrap_optimizer(optimizer) optimizer_op = optimizer.minimize(loss) hook.set_mode(smd.modes.TRAIN) with tf.train.MonitoredSession(hooks=[hook]) as sess: for i in range(args.steps): x_ = np.random.random((10, 2)) * args.scale _loss, opt =[loss, optimizer_op], {x: x_}) print (f'Step={i}, Loss={_loss}') I also need to modify the TensorFlow Estimator, to use the SageMaker Debugger-enabled training container and to pass additional parameters. bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000, 'debug_frequency': 1} from sagemaker.debugger import Rule, rule_configs estimator = TensorFlow( role=sagemaker.get_execution_role(), base_job_name='debugger-simple-demo', train_instance_count=1, train_instance_type='ml.c5.2xlarge', image_name=cpu_docker_image_name, entry_point='', framework_version='1.15', py_version='py3', script_mode=True, hyperparameters=bad_hyperparameters, rules = [Rule.sagemaker(rule_configs.exploding_tensor())] ) 2019-11-27 10:42:02 Starting - Starting the training job... 2019-11-27 10:42:25 Starting - Launching requested ML instances ********* Debugger Rule Status ********* * * ExplodingTensor: InProgress * **************************************** Two jobs are running: the actual training job, and a debug job checking for the rule defined in the Estimator. Quickly, the debug job fails! Describing the training job, I can get more information on what happened. description = client.describe_training_job(TrainingJobName=job_name) print(description['DebugRuleEvaluationStatuses'][0]['RuleConfigurationName']) print(description['DebugRuleEvaluationStatuses'][0]['RuleEvaluationStatus']) ExplodingTensor IssuesFound Let’s take a look at the saved tensors. Exploring Tensors I can easily grab the tensors saved in S3 during the training process. s3_output_path = description["DebugConfig"]["DebugHookConfig"]["S3OutputPath"] trial = create_trial(s3_output_path) Let’s list available tensors. trial.tensors() ['loss/loss:0', 'gradients/multiply/MatMul_1_grad/tuple/control_dependency_1:0', 'initialize/weight1:0'] All values are numpy arrays, and I can easily iterate over them. tensor = 'gradients/multiply/MatMul_1_grad/tuple/control_dependency_1:0' for s in list(trial.tensor(tensor).steps()): print("Value: ", trial.tensor(tensor).step(s).value) Value: [[1.1508383e+23] [1.0809098e+23]] Value: [[1.0278440e+23] [1.1347468e+23]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] As tensor names include the TensorFlow scope defined in the training code, I can easily see that something is wrong with my matrix multiplication. # Compute true label y = tf.matmul(x, w0) # Compute "predicted" label y_hat = tf.matmul(x, w) Digging a little deeper, the x input is modified by a scaling parameter, which I set to 100000000000 in the Estimator. The learning rate doesn’t look sane either. Bingo! x_ = np.random.random((10, 2)) * args.scale bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000, 'debug_frequency': 1} As you probably knew all along, setting these hyperparameters to more reasonable values will fix the training issue. Now Available! We believe Amazon SageMaker Debugger will help you find and solve training issues quicker, so it’s now your turn to go bug hunting. Amazon SageMaker Debugger is available today in all commercial regions where Amazon SageMaker is available. Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts. - Julien    

Amazon SageMaker Model Monitor – Fully Managed Automatic Monitoring For Your Machine Learning Models

Today, we’re extremely happy to announce Amazon SageMaker Model Monitor, a new capability of Amazon SageMaker that automatically monitors machine learning (ML) models in production, and alerts you when data quality issues appear. The first thing I learned when I started working with data is that there is no such thing as paying too much attention to data quality. Raise your hand if you’ve spent hours hunting down problems caused by unexpected NULL values or by exotic character encodings that somehow ended up in one of your databases. As models are literally built from large amounts of data, it’s easy to see why ML practitioners spend so much time caring for their data sets. In particular, they make sure that data samples in the training set (used to train the model) and in the validation set (used to measure its accuracy) have the same statistical properties. There be monsters! Although you have full control over your experimental data sets, the same can’t be said for real-life data that your models will receive. Of course, that data will be unclean, but a more worrisome problem is “data drift”, i.e. a gradual shift in the very statistical nature of the data you receive. Minimum and maximum values, mean, average, variance, and more: all these are key attributes that shape assumptions and decisions made during the training of a model. Intuitively, you can surely feel that any significant change in these values would impact the accuracy of predictions: imagine a loan application predicting higher amounts because input features are drifting or even missing! Detecting these conditions is pretty difficult: you would need to capture data received by your models, run all kinds of statistical analysis to compare that data to the training set, define rules to detect drift, send alerts if it happens… and do it all over again each time you update your models. Expert ML practitioners certainly know how to build these complex tools, but at the great expense of time and resources. Undifferentiated heavy lifting strikes again… To help all customers focus on creating value instead, we built Amazon SageMaker Model Monitor. Let me tell you more. Introducing Amazon SageMaker Model Monitor A typical monitoring session goes like this. You first start from a SageMaker endpoint to monitor, either an existing one, or a new one created specifically for monitoring purposes. You can use SageMaker Model Monitor on any endpoint, whether the model was trained with a built-in algorithm, a built-in framework, or your own container. Using the SageMaker SDK, you can capture a configurable fraction of the data sent to the endpoint (you can also capture predictions if you’d like), and store it in one of your Amazon Simple Storage Service (S3) buckets. Captured data is enriched with metadata (content type, timestamp, etc.), and you can secure and access it just like any S3 object. Then, you create a baseline from the data set that was used to train the model deployed on the endpoint (of course, you can reuse an existing baseline, too). This will fire up a Amazon SageMaker Processing job where SageMaker Model Monitor will: Infer a schema for the input data, i.e. type and completeness information for each feature. You should review it, and update it if needed. For pre-built containers only, compute feature statistics using Deequ, an open source tool based on Apache Spark that is developed and used at Amazon (blog post and research paper). These statistics include KLL sketches, an advanced technique to compute accurate quantiles on streams of data, that we recently contributed to Deequ. Using these artifacts, the next step is to launch a monitoring schedule, to let SageMaker Model Monitor inspect collected data and prediction quality. Whether you’re using a built-in or custom container, a number of built-in rules are applied, and reports are periodically pushed to S3. The reports contain statistics and schema information on the data received during the latest time frame, as well as any violation that was detected. Last but not least, SageMaker Model Monitor emits per-feature metrics to Amazon CloudWatch, which you can use to set up dashboards and alerts. The summary metrics from CloudWatch are also visible in Amazon SageMaker Studio, and of course all statistics, monitoring results and data collected can be viewed and further analyzed in a notebook. For more information and an example on how to use SageMaker Model Monitor using AWS CloudFormation, refer to the developer guide. Now, let’s do a demo, using a churn prediction model trained with the built-in XGBoost algorithm. Enabling Data Capture The first step is to create an endpoint configuration to enable data capture. Here, I decide to capture 100% of incoming data, as well as model output (i.e. predictions). I’m also passing the content types for CSV and JSON data. data_capture_configuration = { "EnableCapture": True, "InitialSamplingPercentage": 100, "DestinationS3Uri": s3_capture_upload_path, "CaptureOptions": [ { "CaptureMode": "Output" }, { "CaptureMode": "Input" } ], "CaptureContentTypeHeader": { "CsvContentTypes": ["text/csv"], "JsonContentTypes": ["application/json"] } Next, I create the endpoint using the usual CreateEndpoint API. create_endpoint_config_response = sm_client.create_endpoint_config( EndpointConfigName = endpoint_config_name, ProductionVariants=[{ 'InstanceType':'ml.m5.xlarge', 'InitialInstanceCount':1, 'InitialVariantWeight':1, 'ModelName':model_name, 'VariantName':'AllTrafficVariant' }], DataCaptureConfig = data_capture_configuration) On an existing endpoint, I would have used the UpdateEndpoint API to seamlessly update the endpoint configuration. After invoking the endpoint repeatedly, I can see some captured data in S3 (output was edited for clarity). $ aws s3 ls --recursive s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/datacapture/DEMO-xgb-churn-pred-model-monitor-2019-11-22-07-59-33/ AllTrafficVariant/2019/11/22/08/24-40-519-9a9273ca-09c2-45d3-96ab-fc7be2402d43.jsonl AllTrafficVariant/2019/11/22/08/25-42-243-3e1c653b-8809-4a6b-9d51-69ada40bc809.jsonl Here’s a line from one of these files. "endpointInput":{ "observedContentType":"text/csv", "mode":"INPUT", "data":"132,25,113.2,96,269.9,107,229.1,87,7.1,7,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1", "encoding":"CSV" }, "endpointOutput":{ "observedContentType":"text/csv; charset=utf-8", "mode":"OUTPUT", "data":"0.01076381653547287", "encoding":"CSV"} }, "eventMetadata":{ "eventId":"6ece5c74-7497-43f1-a263-4833557ffd63", "inferenceTime":"2019-11-22T08:24:40Z"}, "eventVersion":"0"} Pretty much what I expected. Now, let’s create a baseline for this model. Creating A Monitoring Baseline This is a very simple step: pass the location of the baseline data set, and the location where results should be stored. from processingjob_wrapper import ProcessingJob processing_job = ProcessingJob(sm_client, role). create(job_name, baseline_data_uri, baseline_results_uri) Once that job is complete, I can see two new objects in S3: one for statistics, and one for constraints. aws s3 ls s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/baselining/results/ constraints.json statistics.json The constraints.json file tells me about the inferred schema for the training data set (don’t forget to check it’s accurate). Each feature is typed, and I also get information on whether a feature is always present or not (1.0 means 100% here). Here are the first few lines. { "version" : 0.0, "features" : [ { "name" : "Churn", "inferred_type" : "Integral", "completeness" : 1.0 }, { "name" : "Account Length", "inferred_type" : "Integral", "completeness" : 1.0 }, { "name" : "VMail Message", "inferred_type" : "Integral", "completeness" : 1.0 }, { "name" : "Day Mins", "inferred_type" : "Fractional", "completeness" : 1.0 }, { "name" : "Day Calls", "inferred_type" : "Integral", "completeness" : 1.0 At the end of that file, I can see configuration information for CloudWatch monitoring: turn it on or off, set the drift threshold, etc. "monitoring_config" : { "evaluate_constraints" : "Enabled", "emit_metrics" : "Enabled", "distribution_constraints" : { "enable_comparisons" : true, "min_domain_mass" : 1.0, "comparison_threshold" : 1.0 } } The statistics.json file shows different statistics for each feature (mean, average, quantiles, etc.), as well as unique values received by the endpoint. Here’s an example. "name" : "Day Mins", "inferred_type" : "Fractional", "numerical_statistics" : { "common" : { "num_present" : 2333, "num_missing" : 0 }, "mean" : 180.22648949849963, "sum" : 420468.3999999996, "std_dev" : 53.987178959901556, "min" : 0.0, "max" : 350.8, "distribution" : { "kll" : { "buckets" : [ { "lower_bound" : 0.0, "upper_bound" : 35.08, "count" : 14.0 }, { "lower_bound" : 35.08, "upper_bound" : 70.16, "count" : 48.0 }, { "lower_bound" : 70.16, "upper_bound" : 105.24000000000001, "count" : 130.0 }, { "lower_bound" : 105.24000000000001, "upper_bound" : 140.32, "count" : 318.0 }, { "lower_bound" : 140.32, "upper_bound" : 175.4, "count" : 565.0 }, { "lower_bound" : 175.4, "upper_bound" : 210.48000000000002, "count" : 587.0 }, { "lower_bound" : 210.48000000000002, "upper_bound" : 245.56, "count" : 423.0 }, { "lower_bound" : 245.56, "upper_bound" : 280.64, "count" : 180.0 }, { "lower_bound" : 280.64, "upper_bound" : 315.72, "count" : 58.0 }, { "lower_bound" : 315.72, "upper_bound" : 350.8, "count" : 10.0 } ], "sketch" : { "parameters" : { "c" : 0.64, "k" : 2048.0 }, "data" : [ [ 178.1, 160.3, 197.1, 105.2, 283.1, 113.6, 232.1, 212.7, 73.3, 176.9, 161.9, 128.6, 190.5, 223.2, 157.9, 173.1, 273.5, 275.8, 119.2, 174.6, 133.3, 145.0, 150.6, 220.2, 109.7, 155.4, 172.0, 235.6, 218.5, 92.7, 90.7, 162.3, 146.5, 210.1, 214.4, 194.4, 237.3, 255.9, 197.9, 200.2, 120, ... Now, let’s start monitoring our endpoint. Monitoring An Endpoint Again, one API call is all that it takes: I simply create a monitoring schedule for my endpoint, passing the constraints and statistics file for the baseline data set. Optionally, I could also pass preprocessing and postprocessing functions, should I want to tweak data and predictions. ms = MonitoringSchedule(sm_client, role) schedule = ms.create( mon_schedule_name, endpoint_name, s3_report_path, # record_preprocessor_source_uri=s3_code_preprocessor_uri, # post_analytics_source_uri=s3_code_postprocessor_uri, baseline_statistics_uri=baseline_results_uri + '/statistics.json', baseline_constraints_uri=baseline_results_uri+ '/constraints.json' ) Then, I start sending bogus data to the endpoint, i.e. samples constructed from random values, and I wait for SageMaker Model Monitor to start generating reports. The suspense is killing me! Inspecting Reports Quickly, I see that reports are available in S3. mon_executions = sm_client.list_monitoring_executions(MonitoringScheduleName=mon_schedule_name, MaxResults=3) for execution_summary in mon_executions['MonitoringExecutionSummaries']: print("ProcessingJob: {}".format(execution_summary['ProcessingJobArn'].split('/')[1])) print('MonitoringExecutionStatus: {} \n'.format(execution_summary['MonitoringExecutionStatus'])) ProcessingJob: model-monitoring-201911221050-df2c7fc4 MonitoringExecutionStatus: Completed ProcessingJob: model-monitoring-201911221040-3a738dd7 MonitoringExecutionStatus: Completed ProcessingJob: model-monitoring-201911221030-83f15fb9 MonitoringExecutionStatus: Completed Let’s find the reports for one of these monitoring jobs. desc_analytics_job_result=sm_client.describe_processing_job(ProcessingJobName=job_name) report_uri=desc_analytics_job_result['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri'] print('Report Uri: {}'.format(report_uri)) Report Uri: s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/reports/2019112208-2019112209 Ok, so what do we have here? aws s3 ls s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/reports/2019112208-2019112209/ constraint_violations.json constraints.json statistics.json As you would expect, the constraints.json and statistics.json contain schema and statistics information on the data samples processed by the monitoring job. Let’s open directly the third one, constraints_violations.json! violations" : [ { "feature_name" : "State_AL", "constraint_check_type" : "data_type_check", "description" : "Value: 0.8 does not meet the constraint requirement! " }, { "feature_name" : "Eve Mins", "constraint_check_type" : "baseline_drift_check", "description" : "Numerical distance: 0.2711598746081505 exceeds numerical threshold: 0" }, { "feature_name" : "CustServ Calls", "constraint_check_type" : "baseline_drift_check", "description" : "Numerical distance: 0.6470588235294117 exceeds numerical threshold: 0" } Oops! It looks like I’ve been assigning floating point values to integer features: surely that’s not going to work too well! Some features are also exhibiting drift, that’s not good either. Maybe something is wrong with my data ingestion process, or maybe the distribution of data has actually changed, and I need to retrain the model. As all this information is available as CloudWatch metrics, I could define thresholds, set alarms and even trigger new training jobs automatically. Now Available! As you can see, Amazon SageMaker Model Monitor is easy to set up, and helps you quickly know about quality issues in your ML models. Now it’s your turn: you can start using Amazon SageMaker Model Monitor today in all commercial regions where Amazon SageMaker is available. This capability is also integrated in Amazon SageMaker Studio, our workbench for ML projects. Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts. - Julien

Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation

Today, we’re extremely happy to launch Amazon SageMaker Processing, a new capability of Amazon SageMaker that lets you easily run your preprocessing, postprocessing and model evaluation workloads on fully managed infrastructure. Training an accurate machine learning (ML) model requires many different steps, but none is potentially more important than preprocessing your data set, e.g.: Converting the data set to the input format expected by the ML algorithm you’re using, Transforming existing features to a more expressive representation, such as one-hot encoding categorical features, Rescaling or normalizing numerical features, Engineering high level features, e.g. replacing mailing addresses with GPS coordinates, Cleaning and tokenizing text for natural language processing applications, And more! These tasks involve running bespoke scripts on your data set, (beneath a moonless sky, I’m told) and saving the processed version for later use by your training jobs. As you can guess, running them manually or having to build and scale automation tools is not an exciting prospect for ML teams. The same could be said about postprocessing jobs (filtering, collating, etc.) and model evaluation jobs (scoring models against different test sets). Solving this problem is why we built Amazon SageMaker Processing. Let me tell you more. Introducing Amazon SageMaker Processing Amazon SageMaker Processing introduces a new Python SDK that lets data scientists and ML engineers easily run preprocessing, postprocessing and model evaluation workloads on Amazon SageMaker. This SDK uses SageMaker’s built-in container for scikit-learn, possibly the most popular library one for data set transformation. If you need something else, you also have the ability to use your own Docker images without having to conform to any Docker image specification: this gives you maximum flexibility in running any code you want, whether on SageMaker Processing, on AWS container services like Amazon ECS and Amazon Elastic Kubernetes Service, or even on premise. How about a quick demo with scikit-learn? Then, I’ll briefly discuss using your own container. Of course, you’ll find complete examples on Github. Preprocessing Data With The Built-In Scikit-Learn Container Here’s how to use the SageMaker Processing SDK to run your scikit-learn jobs. First, let’s create an SKLearnProcessor object, passing the scikit-learn version we want to use, as well as our managed infrastructure requirements. from sagemaker.sklearn.processing import SKLearnProcessor sklearn_processor = SKLearnProcessor(framework_version='0.20.0', role=role, instance_count=1, instance_type='ml.m5.xlarge') Then, we can run our preprocessing script (more on this fellow in a minute) like so: The data set (dataset.csv) is automatically copied inside the container under the destination directory (/input). We could add additional inputs if needed. This is where the Python script ( reads it. Optionally, we could pass command line arguments to the script. It preprocesses it, splits it three ways, and saves the files inside the container under /opt/ml/processing/output/train, /opt/ml/processing/output/validation, and /opt/ml/processing/output/test. Once the job completes, all outputs are automatically copied to your default SageMaker bucket in S3. from sagemaker.processing import ProcessingInput, ProcessingOutput code='', # arguments = ['arg1', 'arg2'], inputs=[ProcessingInput( source='dataset.csv', destination='/opt/ml/processing/input')], outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'), ProcessingOutput(source='/opt/ml/processing/output/validation'), ProcessingOutput(source='/opt/ml/processing/output/test')] ) That’s it! Let’s put everything together by looking at the skeleton of the preprocessing script. import pandas as pd from sklearn.model_selection import train_test_split # Read data locally df = pd.read_csv('/opt/ml/processing/input/dataset.csv') # Preprocess the data set downsampled = apply_mad_data_science_skills(df) # Split data set into training, validation, and test train, test = train_test_split(downsampled, test_size=0.2) train, validation = train_test_split(train, test_size=0.2) # Create local output directories try: os.makedirs('/opt/ml/processing/output/train') os.makedirs('/opt/ml/processing/output/validation') os.makedirs('/opt/ml/processing/output/test') except: pass # Save data locally train.to_csv("/opt/ml/processing/output/train/train.csv") validation.to_csv("/opt/ml/processing/output/validation/validation.csv") test.to_csv("/opt/ml/processing/output/test/test.csv") print('Finished running processing job') A quick look to the S3 bucket confirms that files have been successfully processed and saved. Now I could use them directly as input for a SageMaker training job. $ aws s3 ls --recursive s3://sagemaker-us-west-2-123456789012/sagemaker-scikit-learn-2019-11-20-13-57-17-805/output 2019-11-20 15:03:22 19967 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/test.csv 2019-11-20 15:03:22 64998 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/train.csv 2019-11-20 15:03:22 18058 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/validation.csv Now what about using your own container? Processing Data With Your Own Container Let’s say you’d like to preprocess text data with the popular spaCy library. Here’s how you could define a vanilla Docker container for it. FROM python:3.7-slim-buster # Install spaCy, pandas, and an english language model for spaCy. RUN pip3 install spacy==2.2.2 && pip3 install pandas==0.25.3 RUN python3 -m spacy download en_core_web_md # Make sure python doesn't buffer stdout so we get logs ASAP. ENV PYTHONUNBUFFERED=TRUE ENTRYPOINT ["python3"] Then, you would build the Docker container, test it locally, and push it to Amazon Elastic Container Registry, our managed Docker registry service. The next step would be to configure a processing job using the ScriptProcessor object, passing the name of the container you built and pushed. from sagemaker.processing import ScriptProcessor script_processor = ScriptProcessor(image_uri='', role=role, instance_count=1, instance_type='ml.m5.xlarge') Finally, you would run the job just like in the previous example.'', inputs=[ProcessingInput( source='dataset.csv', destination='/opt/ml/processing/input_data')], outputs=[ProcessingOutput(source='/opt/ml/processing/processed_data')], arguments=['tokenizer', 'lemmatizer', 'pos-tagger'] ) The rest of the process is exactly the same as above: copy the input(s) inside the container, copy the output(s) from the container to S3. Pretty simple, don’t you think? Again, I focused on preprocessing, but you can run similar jobs for postprocessing and model evaluation. Don’t forget to check out the examples in Github. Now Available! Amazon SageMaker Processing is available today in all commercial AWS Regions where Amazon SageMaker is available. Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts. — Julien

Amazon SageMaker Autopilot – Automatically Create High-Quality Machine Learning Models With Full Control And Visibility

Today, we’re extremely happy to launch Amazon SageMaker Autopilot to automatically create the best classification and regression machine learning models, while allowing full control and visibility. In 1959, Arthur Samuel defined machine learning as the ability for computers to learn without being explicitly programmed. In practice, this means finding an algorithm than can extract patterns from an existing data set, and use these patterns to build a predictive model that will generalize well to new data. Since then, lots of machine learning algorithms have been invented, giving scientists and engineers plenty of options to choose from, and helping them build amazing applications. However, this abundance of algorithms also creates a difficulty: which one should you pick? How can you reliably figure out which one will perform best on your specific business problem? In addition, machine learning algorithms usually have a long list of training parameters (also called hyperparameters) that need to be set “just right” if you want to squeeze every bit of extra accuracy from your models. To make things worse, algorithms also require data to be prepared and transformed in specific ways (aka feature engineering) for optimal learning… and you need to pick the best instance type. If you think this sounds like a lot of experimental, trial and error work, you’re absolutely right. Machine learning is definitely of mix of hard science and cooking recipes, making it difficult for non-experts to get good results quickly. What if you could rely on a fully managed service to solve that problem for you? Call an API and get the job done? Enter Amazon SageMaker Autopilot. Introducing Amazon SageMaker Autopilot Using a single API call, or a few clicks in Amazon SageMaker Studio, SageMaker Autopilot first inspects your data set, and runs a number of candidates to figure out the optimal combination of data preprocessing steps, machine learning algorithms and hyperparameters. Then, it uses this combination to train an Inference Pipeline, which you can easily deploy either on a real-time endpoint or for batch processing. As usual with Amazon SageMaker, all of this takes place on fully-managed infrastructure. Last but not least, SageMaker Autopilot also generate Python code showing you exactly how data was preprocessed: not only can you understand what SageMaker Autopilot did, you can also reuse that code for further manual tuning if you’re so inclined. As of today, SageMaker Autopilot supports: Input data in tabular format, with automatic data cleaning and preprocessing, Automatic algorithm selection for linear regression, binary classification, and multi-class classification, Automatic hyperparameter optimization, Distributed training, Automatic instance and cluster size selection. Let me show you how simple this is. Using AutoML with Amazon SageMaker Autopilot Let’s use this sample notebook as a starting point: it builds a binary classification model predicting if customers will accept or decline a marketing offer. Please take a few minutes to read it: as you will see, the business problem itself is easy to understand, and the data set is neither large or complicated. Yet, several non-intuitive preprocessing steps are required, and there’s also the delicate matter of picking an algorithm and its parameters… SageMaker Autopilot to the rescue! First, I grab a copy of the data set, and take a quick look at the first few lines. Then, I upload it in Amazon Simple Storage Service (S3) without any preprocessing whatsoever. sess.upload_data(path="automl-train.csv", key_prefix=prefix + "/input") 's3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-automl-dm/input/automl-train.csv' Now, let’s configure the AutoML job: Set the location of the data set, Select the target attribute that I want the model to predict: in this case, it’s the ‘y’ column showing if a customer accepted the offer or not, Set the location of training artifacts. input_data_config = [{ 'DataSource': { 'S3DataSource': { 'S3DataType': 'S3Prefix', 'S3Uri': 's3://{}/{}/input'.format(bucket,prefix) } }, 'TargetAttributeName': 'y' } ] output_data_config = { 'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix) } That’s it! Of course, SageMaker Autopilot has a number of options that will come in handy as you learn more about your data and your models, e.g.: Set the type of problem you want to train on: linear regression, binary classification, or multi-class classification. If you’re not sure, SageMaker Autopilot will figure it out automatically by analyzing the values of the target attribute. Use a specific metric for model evaluation. Define completion criteria: maximum running time, etc. One thing I don’t have to do is size the training cluster, as SageMaker Autopilot uses a heuristic based on data size and algorithm. Pretty cool! With configuration out of the way, I can fire up the job with the CreateAutoMl API. auto_ml_job_name = 'automl-dm-' + timestamp_suffix print('AutoMLJobName: ' + auto_ml_job_name) sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name, InputDataConfig=input_data_config, OutputDataConfig=output_data_config, RoleArn=role) AutoMLJobName: automl-dm-28-10-17-49 A job runs in four steps (you can use the DescribeAutoMlJob API to view them). Splitting the data set into train and validation sets, Analyzing data, in order to recommend pipelines that should be tried out on the data set, Feature engineering, where transformations are applied to the data set and to individual features,  Pipeline selection and hyperparameter tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm. Once the maximum number of candidates – or one of the stopping conditions – has been reached, the job is complete. I can get detailed information on all candidates using the ListCandidatesForAutoMlJob API , and also view them in the AWS console. candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates'] index = 1 for candidate in candidates: print (str(index) + " " + candidate['CandidateName'] + " " + str(candidate['FinalAutoMLJobObjectiveMetric']['Value'])) index += 1 1 automl-dm-28-tuning-job-1-fabb8-001-f3b6dead 0.9186699986457825 2 automl-dm-28-tuning-job-1-fabb8-004-03a1ff8a 0.918304979801178 3 automl-dm-28-tuning-job-1-fabb8-003-c443509a 0.9181839823722839 4 automl-dm-28-tuning-job-1-ed07c-006-96f31fde 0.9158779978752136 5 automl-dm-28-tuning-job-1-ed07c-004-da2d99af 0.9130859971046448 6 automl-dm-28-tuning-job-1-ed07c-005-1e90fd67 0.9130859971046448 7 automl-dm-28-tuning-job-1-ed07c-008-4350b4fa 0.9119930267333984 8 automl-dm-28-tuning-job-1-ed07c-007-dae75982 0.9119930267333984 9 automl-dm-28-tuning-job-1-ed07c-009-c512379e 0.9119930267333984 10 automl-dm-28-tuning-job-1-ed07c-010-d905669f 0.8873512744903564 For now, I’m only interested in the best trial: 91.87% validation accuracy. Let’s deploy it to a SageMaker endpoint, just like we would deploy any model: Create a model, Create an endpoint configuration, Create the endpoint. model_arn = sm.create_model(Containers=best_candidate['InferenceContainers'], ModelName=model_name, ExecutionRoleArn=role) ep_config = sm.create_endpoint_config(EndpointConfigName = epc_name, ProductionVariants=[{'InstanceType':'ml.m5.2xlarge', 'InitialInstanceCount':1, 'ModelName':model_name, 'VariantName':variant_name}]) create_endpoint_response = sm.create_endpoint(EndpointName=ep_name, EndpointConfigName=epc_name) After a few minutes, the endpoint is live, and I can use it for prediction. SageMaker business as usual! Now, I bet you’re curious about how the model was built, and what the other candidates are. Let me show you. Full Visibility And Control with Amazon SageMaker Autopilot SageMaker Autopilot stores training artifacts in S3, including two auto-generated notebooks! job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name) job_data_notebook = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation'] job_candidate_notebook = job['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation'] print(job_data_notebook) print(job_candidate_notebook) s3://<PREFIX_REMOVED>/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb s3://<PREFIX_REMOVED>/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb The first one contains information about the data set. The second one contains full details on the SageMaker Autopilot job: candidates, data preprocessing steps, etc. All code is available, as well as ‘knobs’ you can change for further experimentation. As you can see, you have full control and visibility on how models are built. Now Available! I’m very excited about Amazon SageMaker Autopilot, because it’s making machine learning simpler and more accessible than ever. Whether you’re just beginning with machine learning, or whether you’re a seasoned practitioner, SageMaker Autopilot will help you build better models quicker using either one of these paths: Easy no-code path in Amazon SageMaker Studio, Easy code path with the SageMaker Autopilot SDK, In-depth path with candidate generation notebook. Now it’s your turn. You can start using SageMaker Autopilot today in the following regions: US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Canada (Central), South America (São Paulo), Europe (Ireland), Europe (London), Europe (Paris), Europe (Frankfurt), Middle East (Bahrain), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo). Please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts. — Julien

Amazon SageMaker Experiments – Organize, Track And Compare Your Machine Learning Trainings

Today, we’re extremely happy to announce Amazon SageMaker Experiments, a new capability of Amazon SageMaker that lets you organize, track, compare and evaluate machine learning (ML) experiments and model versions. ML is a highly iterative process. During the course of a single project, data scientists and ML engineers routinely train thousands of different models in search of maximum accuracy. Indeed, the number of combinations for algorithms, data sets, and training parameters (aka hyperparameters) is infinite… and therein lies the proverbial challenge of finding a needle in a haystack. Tools like Automatic Model Tuning and Amazon SageMaker Autopilot help ML practitioners explore a large number of combinations automatically, and quickly zoom in on high-performance models. However, they further add to the explosive growth of training jobs. Over time, this creates a new difficulty for ML teams, as it becomes near-impossible to efficiently deal with hundreds of thousands of jobs: keeping track of metrics, grouping jobs by experiment, comparing jobs in the same experiment or across experiments, querying past jobs, etc. Of course, this can be solved by building, managing and scaling bespoke tools: however, doing so diverts valuable time and resources away from actual ML work. In the spirit of helping customers focus on ML and nothing else, we couldn’t leave this problem unsolved. Introducing Amazon SageMaker Experiments First, let’s define core concepts: A trial is a collection of training steps involved in a single training job. Training steps typically includes preprocessing, training, model evaluation, etc. A trial is also enriched with metadata for inputs (e.g. algorithm, parameters, data sets) and outputs (e.g. models, checkpoints, metrics). An experiment is simply a collection of trials, i.e. a group of related training jobs. The goal of SageMaker Experiments is to make it as simple as possible to create experiments, populate them with trials, and run analytics across trials and experiments. For this purpose, we introduce a new Python SDK containing logging and analytics APIs. Running your training jobs on SageMaker or SageMaker Autopilot, all you have to do is pass an extra parameter to the Estimator, defining the name of the experiment that this trial should be attached to. All inputs and outputs will be logged automatically. Once you’ve run your training jobs, the SageMaker Experiments SDK lets you load experiment and trial data in the popular pandas dataframe format. Pandas truly is the Swiss army knife of ML practitioners, and you’ll be able to perform any analysis that you may need. Go one step further by building cool visualizations with matplotlib, and you’ll be well on your way to taming that wild horde of training jobs! As you would expect, SageMaker Experiments is nicely integrated in Amazon SageMaker Studio. You can run complex queries to quickly find the past trial you’re looking for. You can also visualize real-time model leaderboards and metric charts. How about a quick demo? Logging Training Information With Amazon SageMaker Experiments Let’s start from a PyTorch script classifying images from the MNIST data set, using a simple two-layer convolution neural network (CNN). If I wanted to run a single job on SageMaker, I could use the PyTorch estimator like so: estimator = PyTorch( entry_point='', role=role, sagemaker_session=sess framework_version='1.1.0', train_instance_count=1, train_instance_type='ml.p3.2xlarge'){'training': inputs}) Instead, let’s say that I want to run multiple versions of the same script, changing only one of the hyperparameters (the number of convolution filters used by the two convolution layers, aka number of hidden channels) to measure its impact on model accuracy. Of course, we could run these jobs, grab the training logs, extract metrics with fancy text filtering, etc. Or we could use SageMaker Experiments! All I need to do is: Set up an experiment, Use a tracker to log experiment metadata, Create a trial for each training job I want to run, Run each training job, passing parameters for the experiment name and the trial name. First things first, let’s take care of the experiment. from smexperiments.experiment import Experiment mnist_experiment = Experiment.create( experiment_name="mnist-hand-written-digits-classification", description="Classification of mnist hand-written digits", sagemaker_boto_client=sm) Then, let’s add a few things that we want to keep track of, like the location of the data set and normalization values we applied to it. from smexperiments.tracker import Tracker with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker: tracker.log_input(name="mnist-dataset", media_type="s3/uri", value=inputs) tracker.log_parameters({ "normalization_mean": 0.1307, "normalization_std": 0.3081, }) Now let’s run a few jobs. I simply loop over the different values that I want to try, creating a new trial for each training job and adding the tracker information to it. for i, num_hidden_channel in enumerate([2, 5, 10, 20, 32]): trial_name = f"cnn-training-job-{num_hidden_channel}-hidden-channels-{int(time.time())}" cnn_trial = Trial.create( trial_name=trial_name, experiment_name=mnist_experiment.experiment_name, sagemaker_boto_client=sm, ) cnn_trial.add_trial_component(tracker.trial_component) Then, I configure the estimator, passing the value for the hyperparameter I’m interested in, and leaving the other ones as is. I’m also passing regular expressions to extract metrics from the training log. All these will push stored in the trial: in fact, all parameters (passed or default) will be.   estimator = PyTorch( entry_point='', role=role, sagemaker_session=sess, framework_version='1.1.0', train_instance_count=1, train_instance_type='ml.p3.2xlarge', hyperparameters={ 'hidden_channels': num_hidden_channels }, metric_definitions=[ {'Name':'train:loss', 'Regex':'Train Loss: (.*?);'}, {'Name':'test:loss', 'Regex':'Test Average loss: (.*?),'}, {'Name':'test:accuracy', 'Regex':'Test Accuracy: (.*?)%;'} ] ) Finally, I run the training job, associating it to the experiment and the trial. cnn_training_job_name = "cnn-training-job-{}".format(int(time.time())) inputs={'training': inputs}, job_name=cnn_training_job_name, experiment_config={ "ExperimentName": mnist_experiment.experiment_name, "TrialName": cnn_trial.trial_name, "TrialComponentDisplayName": "Training", } ) # end of loop Once all jobs are complete, I can run analytics. Let’s find out how we did. Analytics with Amazon SageMaker Experiments All information on an experiment can be easily exported to a Pandas DataFrame. from import ExperimentAnalytics trial_component_analytics = ExperimentAnalytics( sagemaker_session=sess, experiment_name=mnist_experiment.experiment_name ) analytic_table = trial_component_analytics.dataframe() If I want to drill down, I can specify additional parameters, e.g.: trial_component_analytics = ExperimentAnalytics( sagemaker_session=sess, experiment_name=mnist_experiment.experiment_name, sort_by="metrics.test:accuracy.max", sort_order="Descending", metric_names=['test:accuracy'], parameter_names=['hidden_channels', 'epochs', 'dropout', 'optimizer'] ) analytic_table = trial_component_analytics.dataframe() This builds a DataFrame where trials are sorted by decreasing test accuracy, and showing only some of the hyperparameters for each trial. for col in analytic_table.columns: print(col) TrialComponentName DisplayName SourceArn dropout epochs hidden_channels optimizer test:accuracy - Min test:accuracy - Max test:accuracy - Avg test:accuracy - StdDev test:accuracy - Last test:accuracy - Count From here on, your imagination is the limit. Pandas is the Swiss army knife of data analysis, and you’ll be able to compare trials and experiments in every possible way. Last but not least, thanks to the integration with Amazon SageMaker Studio, you’ll be able to visualize all this information in real-time with predefined widgets. To learn more about Amazon SageMaker Studio, visit this blog post. Now Available! I just scratched the surface of what you can do with Amazon SageMaker Experiments, and I believe it will help you tame the wild horde of jobs that you have to deal with everyday. The service is available today in all commercial AWS Regions where Amazon SageMaker is available. Give it a try and please send us feedback, either in the AWS forum for Amazon SageMaker, or through your usual AWS contacts. - Julien  

Now Available on Amazon SageMaker: The Deep Graph Library

Today, we’re happy to announce that the Deep Graph Library, an open source library built for easy implementation of graph neural networks, is now available on Amazon SageMaker. In recent years, Deep learning has taken the world by storm thanks to its uncanny ability to extract elaborate patterns from complex data, such as free-form text, images, or videos. However, lots of datasets don’t fit these categories and are better expressed with graphs. Intuitively, we can feel that traditional neural network architectures like convolution neural networks or recurrent neural networks are not a good fit for such datasets, and a new approach is required. A Primer On Graph Neural Networks Graph neural networks (GNN) are one of the most exciting developments in machine learning today, and these reference papers will get you started. GNNs are used to train predictive models on datasets such as: Social networks, where graphs show connections between related people, Recommender systems, where graphs show interactions between customers and items, Chemical analysis, where compounds are modeled as graphs of atoms and bonds, Cybersecurity, where graphs describe connections between source and destination IP addresses, And more! Most of the time, these datasets are extremely large and only partially labeled. Consider a fraud detection scenario where we would try to predict the likelihood that an individual is a fraudulent actor by analyzing his connections to known fraudsters. This problem could be defined as a semi-supervised learning task, where only a fraction of graph nodes would be labeled (‘fraudster’ or ‘legitimate’). This should be a better solution than trying to build a large hand-labeled dataset, and “linearizing” it to apply traditional machine learning algorithms. Working on these problems requires domain knowledge (retail, finance, chemistry, etc.), computer science knowledge (Python, deep learning, open source tools), and infrastructure knowledge (training, deploying, and scaling models). Very few people master all these skills, which is why tools like the Deep Graph Library and Amazon SageMaker are needed. Introducing The Deep Graph Library First released on Github in December 2018, the Deep Graph Library (DGL) is a Python open source library that helps researchers and scientists quickly build, train, and evaluate GNNs on their datasets. DGL is built on top of popular deep learning frameworks like PyTorch and Apache MXNet. If you know either one or these, you’ll find yourself quite at home. No matter which framework you use, you can get started easily thanks to these beginner-friendly examples. I also found the slides and code for the GTC 2019 workshop very useful. Once you’re done with toy examples, you can start exploring the collection of cutting edge models already implemented in DGL. For example, you can train a document classification model using a Graph Convolution Network (GCN) and the CORA dataset by simply running: $ python3 --dataset cora --gpu 0 --self-loop The code for all models is available for inspection and tweaking. These implementations have been carefully validated by AWS teams, who verified performance claims and made sure results could be reproduced. DGL also includes a collection of graph datasets, that you can easily download and experiment with. Of course, you can install and run DGL locally, but to make your life simpler, we added it to the Deep Learning Containers for PyTorch and Apache MXNet. This makes it easy to use DGL on Amazon SageMaker, in order to train and deploy models at any scale, without having to manage a single server. Let me show you how. Using DGL On Amazon SageMaker We added complete examples in the Github repository for SageMaker examples: one of them trains a simple GNN for molecular toxicity prediction using the Tox21 dataset. The problem we’re trying to solve is figuring it the potential toxicity of new chemical compounds with respect to 12 different targets (receptors inside biological cells, etc.). As you can imagine, this type of analysis is crucial when designing new drugs, and being able to quickly predict results without having to run in vitro experiments helps researchers focus their efforts on the most promising drug candidates. The dataset contains a little over 8,000 compounds: each one is modeled as a graph (atoms are vertices, atomic bonds are edges), and labeled 12 times (one label per target). Using a GNN, we’re going to build a multi-label binary classification model, allowing us to predict the potential toxicity of candidate molecules. In the training script, we can easily download the dataset from the DGL collection. from import Tox21 dataset = Tox21() Similarly, we can easily build a GNN classifier using the DGL model zoo. from dgl import model_zoo model = model_zoo.chem.GCNClassifier(     in_feats=args['n_input'],     gcn_hidden_feats=[args['n_hidden'] for _ in range(args['n_layers'])],     n_tasks=dataset.n_tasks,     classifier_hidden_feats=args['n_hidden']).to(args['device']) The rest of the code is mostly vanilla PyTorch, and you should be able to find your bearings if you’re familiar with this library. When it comes to running this code on Amazon SageMaker, all we have to do is use a SageMaker Estimator, passing the full name of our DGL container, and the name of the training script as a hyperparameter. estimator = sagemaker.estimator.Estimator(container,     role,     train_instance_count=1,     train_instance_type='ml.p3.2xlarge',     hyperparameters={'entrypoint': ''},     sagemaker_session=sess) code_location = sess.upload_data(CODE_PATH, bucket=bucket, key_prefix=custom_code_upload_location){'training-code': code_location}) <output removed> epoch 23/100, batch 48/49, loss 0.4684 epoch 23/100, batch 49/49, loss 0.5389 epoch 23/100, training roc-auc 0.9451 EarlyStopping counter: 10 out of 10 epoch 23/100, validation roc-auc 0.8375, best validation roc-auc 0.8495 Best validation score 0.8495 Test score 0.8273 2019-11-21 14:11:03 Uploading - Uploading generated training model 2019-11-21 14:11:03 Completed - Training job completed Training seconds: 209 Billable seconds: 209 Now, we could grab the trained model in S3, and use it to predict toxicity for large number of compounds, without having to run actual experiments. Fascinating stuff! Now Available! You can start using DGL on Amazon SageMaker today. Give it a try, and please send us feedback in the DGL forum, in the AWS forum for Amazon SageMaker, or through your usual AWS support contacts. – Julien  

New – Amazon Managed Apache Cassandra Service (MCS)

Managing databases at scale is never easy. One of the options to store, retrieve, and manage large amounts of structured data, including key-value and tabular formats, is Apache Cassandra. With Cassandra, you can use the expressive Cassandra Query Language (CQL) to build applications quickly. However, managing large Cassandra clusters can be difficult and takes a lot of time. You need specialized expertise to set up, configure, and maintain the underlying infrastructure, and have a deep understanding of the entire application stack, including the Apache Cassandra open source software. You need to add or remove nodes manually, rebalancing partitions, and doing so while keeping your application available with the required performance. Talking with customers, we found out that they often keep their clusters scaled up for peak load because scaling down is complex. To keep your Cassandra cluster updated, you have to do it node by node. It’s hard to backup and restore a cluster if something goes wrong during an update, and you may end up skipping patches or running an outdated version. Introducing Amazon Managed Cassandra Service Today, we are launching in open preview Amazon Managed Apache Cassandra Service (MCS), a scalable, highly available, and managed Apache Cassandra-compatible database service. Amazon MCS is serverless, so you pay for only the resources you use and the service automatically scales tables up and down in response to application traffic. You can build applications that serve thousands of requests per second with virtually unlimited throughput and storage. With Amazon MCS, you can run your Cassandra workloads on AWS using the same Cassandra application code and developer tools that you use today. Amazon MCS implements the Apache Cassandra version 3.11 CQL API, allowing you to use the code and drivers that you already have in your applications. Updating your application is as easy as changing the endpoint to the one in the Amazon MCS service table. Amazon MCS provides consistent single-digit-millisecond read and write performance at any scale, so you can build applications with low latency to provide a smooth user experience. You have visibility into how your application is performing using Amazon CloudWatch. There is no limit on the size of a table or the number of items, and you do not need to provision storage. Data storage is fully managed and highly available. Your table data is replicated automatically three times across multiple AWS Availability Zones for durability. All customer data is encrypted at rest by default. You can use encryption keys stored in AWS Key Management Service (KMS). Amazon MCS is also integrated with AWS Identity and Access Management (IAM) to help you manage access to your tables and data. Using Amazon Managed Cassandra Service You can use Amazon MCS with the console, CQL, or existing Apache 2.0 licensed Cassandra drivers. In the console there is a CQL editor, or you can connect using cqlsh.  To connect using cqlsh, I need to generate service-specific credentials for an existing IAM user. This is just a command using the AWS Command Line Interface (CLI): aws iam create-service-specific-credential --user-name USERNAME --service-name { "ServiceSpecificCredential": { "CreateDate": "2019-11-27T14:36:16Z", "ServiceName": "", "ServiceUserName": "USERNAME-at-123412341234", "ServicePassword": "...", "ServiceSpecificCredentialId": "...", "UserName": "USERNAME", "Status": "Active" } } Amazon MCS only accepts secure connections using TLS.  I download the Amazon root certificate and edit the cqlshrc configuration file to use it. Now, I can connect with: cqlsh {endpoint} {port} -u {ServiceUserName} -p {ServicePassword} --ssl First, I create a keyspace. A keyspace contains one or more tables and defines the replication strategy for all the tables it contains. With Amazon MCS the default replication strategy for all keyspaces is the Single-region strategy. It replicates data 3 times across multiple Availability Zones in a single AWS Region. To create a keyspace I can use the console or CQL. In the Amazon MCS console, I provide the name for the keyspace. Similarly, I can use CQL to create the bookstore keyspace: CREATE KEYSPACE IF NOT EXISTS bookstore WITH REPLICATION={'class': 'SingleRegionStrategy'}; Now I create a table. A table is where your data is organized and stored. Again, I can use the console or CQL. From the console, I select the bookstore keyspace and give the table a name. Below that, I add the columns for my books table. Each row in a table is referenced by a primary key, that can be composed of one or more columns, the values of which determine which partition the data is stored in. In my case the primary key is the ISBN. Optionally, I can add clustering columns, which determine the sort order of records within a partition. I am not using clustering columns for this table. Alternatively, using CQL, I can create the table with the following commands: USE bookstore; CREATE TABLE IF NOT EXISTS books (isbn text PRIMARY KEY, title text, author text, pages int, year_of_publication int); I now use CQL to insert a record in the books table: INSERT INTO books (isbn, title, author, pages, year_of_publication) VALUES ('978-0201896831', 'The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)', 'Donald E. Knuth', 672, 1997); Let’s run a quick query. In the console, I select the books table and then Query table. In the CQL Editor, I use the default query and select Run command. By default, I see the result of the query in table view: If I prefer, I can see the result in JSON format, similar to what an application using the Cassandra API would see: To insert more records, I use csqlsh again and upload some data from a local CSV file: COPY books (isbn, title, author, pages, year_of_publication) FROM './books.csv' WITH delimiter=',' AND header=TRUE; Now I look again at the content of the books table: SELECT * FROM books; I can select a row using a primary key, or use filtering for additional conditions. For example: SELECT title FROM books WHERE isbn='978-1942788713'; SELECT title FROM books WHERE author='Scott Page' ALLOW FILTERING; With Amazon MCS you can use existing Apache Cassandra 2.0–licensed drivers and developer tools. Open-source Cassandra drivers are available for Java, Python, Ruby, .NET, Node.js, PHP, C++, Perl, and Go. You can learn more in the Amazon MCS documentation. Available in Open Preview Amazon MCS is available today in open preview in US East (N. Virginia), US East (Ohio), Europe (Stockholm), Asia Pacific (Singapore), Asia Pacific (Tokyo). As we work with the Cassandra API libraries, we are contributing bug fixes to the open source Apache Cassandra project. We are also contributing back improvements such as built-in support for AWS authentication (SigV4), which simplifies managing credentials for customers running Cassandra on Amazon Elastic Compute Cloud (EC2), since EC2 and IAM can handle distribution and management of credentials using instance roles automatically. We are also announcing the funding of AWS promotional service credits for testing Cassandra-related open-source projects. To learn more about these contributions, visit the Open Source blog. During the preview, you can use Amazon MCS with on-demand capacity. At general availability, we will also offer the option to use provisioned throughput for more predictable workloads. With on-demand capacity mode, Amazon MCS charges you based on the amount of data your applications read and write from your tables. You do not need to specify how much read and write throughput capacity to provision to your tables because Amazon MCS accommodates your workloads instantly as they scale up or down. As part of the AWS Free Tier, you can get started with Amazon MCS for free. For the first three months, you are offered a monthly free tier of 30 million write request units, 30 million read request units, and 1 GB of storage. Your free tier starts when you create your first Amazon MCS resource. Next year we are making it easier to migrate your data Amazon MCS, adding support to use AWS Database Migration Service. Amazon MCS makes it easy to use Cassandra workloads at any scale, providing a simple programming interface to build new applications, or migrate existing ones. I can’t wait to see what are you going to use it for!

New for Amazon Redshift – Data Lake Export and Federated Query

A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze data using standard SQL and existing Business Intelligence (BI) tools. To get information from unstructured data that would not fit in a data warehouse, you can build a data lake. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. With a data lake built on Amazon Simple Storage Service (S3), you can easily run big data analytics and use machine learning to gain insights from your semi-structured (such as JSON, XML) and unstructured datasets. Today, we are launching two new features to help you improve the way you manage your data warehouse and integrate with a data lake: Data Lake Export to unload data from a Redshift cluster to S3 in Apache Parquet format, an efficient open columnar storage format optimized for analytics. Federated Query to be able, from a Redshift cluster, to query across data stored in the cluster, in your S3 data lake, and in one or more Amazon Relational Database Service (RDS) for PostgreSQL and Amazon Aurora PostgreSQL databases. This architectural diagram gives a quick summary of how these features work and how they can be used together with other AWS services. Let’s explain the interactions you see in the diagram better, starting from how you can use these features, and the advantages they provide. Using Redshift Data Lake Export You can now unload the result of a Redshift query to your S3 data lake in Apache Parquet format. The Parquet format is up to 2x faster to unload and consumes up to 6x less storage in S3, compared to text formats. This enables you to save data transformation and enrichment you have done in Redshift into your S3 data lake in an open format. You can then analyze the data in your data lake with Redshift Spectrum, a feature of Redshift that allows you to query data directly from files on S3. Or you can use different tools such as Amazon Athena, Amazon EMR, or Amazon SageMaker. To try this new feature, I create a new cluster from the Redshift console, and follow this tutorial to load sample data that keeps track of sales of musical events across different venues. I want to correlate this data with social media comments on the events stored in my data lake. To understand their relevance, each event should have a way of comparing its relative sales to other events. Let’s build a query in Redshift to export the data to S3. My data is stored across multiple tables. I need to create a query that gives me a single view of what is going on with sales. I want to join the content of the  sales and date tables, adding information on the gross sales for an event (total_price in the query), and the percentile in terms of all time gross sales compared to all events. To export the result of the query to S3 in Parquet format, I use the following SQL command: UNLOAD ('SELECT sales.*, date.*, total_price, percentile FROM sales, date, (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile FROM (SELECT eventid, sum(pricepaid) total_price FROM sales GROUP BY eventid)) as percentile_events WHERE sales.dateid = date.dateid AND percentile_events.eventid = sales.eventid') TO 's3://MY-BUCKET/DataLake/Sales/' FORMAT AS PARQUET CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole'; To give Redshift write access to my S3 bucket, I am using an AWS Identity and Access Management (IAM) role. I can see the result of the UNLOAD command using the AWS Command Line Interface (CLI). As expected, the output of the query is exported using the Parquet columnar data format: $ aws s3 ls s3://MY-BUCKET/DataLake/Sales/ 2019-11-25 14:26:56 1638550 0000_part_00.parquet 2019-11-25 14:26:56 1635489 0001_part_00.parquet 2019-11-25 14:26:56 1624418 0002_part_00.parquet 2019-11-25 14:26:56 1646179 0003_part_00.parquet To optimize access to data, I can specify one or more partition columns so that unloaded data is automatically partitioned into folders in my S3 bucket. For example, I can unload sales data partitioned by year, month, and day. This enables my queries to take advantage of partition pruning and skip scanning irrelevant partitions, improving query performance and minimizing cost. To use partitioning, I need to add to the previous SQL command the PARTITION BY option, followed by the columns I want to use to partition the data in different directories. In my case, I want to partition the output based on the year and the calendar date (caldate in the query) of the sales. UNLOAD ('SELECT sales.*, date.*, total_price, percentile FROM sales, date, (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile FROM (SELECT eventid, sum(pricepaid) total_price FROM sales GROUP BY eventid)) as percentile_events WHERE sales.dateid = date.dateid AND percentile_events.eventid = sales.eventid') TO 's3://MY-BUCKET/DataLake/SalesPartitioned/' FORMAT AS PARQUET PARTITION BY (year, caldate) CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole'; This time, the output of the query is stored in multiple partitions. For example, here’s the content of a folder for a specific year and date: $ aws s3 ls s3://MY-BUCKET/DataLake/SalesPartitioned/year=2008/caldate=2008-07-20/ 2019-11-25 14:36:17 11940 0000_part_00.parquet 2019-11-25 14:36:17 11052 0001_part_00.parquet 2019-11-25 14:36:17 11138 0002_part_00.parquet 2019-11-25 14:36:18 12582 0003_part_00.parquet Optionally, I can use AWS Glue to set up a Crawler that (on demand or on a schedule) looks for data in my S3 bucket to update the Glue Data Catalog. When the Data Catalog is updated, I can easily query the data using Redshift Spectrum, Athena, or EMR. The sales data is now ready to be processed together with the unstructured and semi-structured  (JSON, XML, Parquet) data in my data lake. For example, I can now use Apache Spark with EMR, or any Sagemaker built-in algorithm to access the data and get new insights. Using Redshift Federated Query You can now also access data in RDS and Aurora PostgreSQL stores directly from your Redshift data warehouse. In this way, you can access data as soon as it is available. Straight from Redshift, you can now perform queries processing data in your data warehouse, transactional databases, and data lake, without requiring ETL jobs to transfer data to the data warehouse. Redshift leverages its advanced optimization capabilities to push down and distribute a significant portion of the computation directly into the transactional databases, minimizing the amount of data moving over the network. Using this syntax, you can add an external schema from an RDS or Aurora PostgreSQL database to a Redshift cluster: CREATE EXTERNAL SCHEMA IF NOT EXISTS online_system FROM POSTGRES DATABASE 'online_sales_db' SCHEMA 'online_system' URI ‘my-hostname' port 5432 IAM_ROLE 'iam-role-arn' SECRET_ARN 'ssm-secret-arn'; Schema and port are optional here. Schema will default to public if left unspecified and default port for PostgreSQL databases is 5432. Redshift is using AWS Secrets Manager to manage the credentials to connect to the external databases. With this command, all tables in the external schema are available and can be used by Redshift for any complex SQL query processing data in the cluster or, using Redshift Spectrum, in your S3 data lake. Coming back to the sales data example I used before, I can now correlate the trends of my historical data of musical events with real-time sales. In this way, I can understand if an event is performing as expected or not, and calibrate my marketing activities without delays. For example, after I define the online commerce database as the online_system external schema in my Redshift cluster, I can compare previous sales with what is in the online commerce system with this simple query: SELECT eventid, sum(pricepaid) total_price, sum(online_pricepaid) online_total_price FROM sales, online_system.current_sales GROUP BY eventid WHERE eventid = online_eventid; Redshift doesn’t import database or schema catalog in its entirety. When a query is run, it localizes the metadata for the Aurora and RDS tables (and views) that are part of the query. This localized metadata is then used for query compilation and plan generation. Available Now Amazon Redshift data lake export is a new tool to improve your data processing pipeline and is supported with Redshift release version 1.0.10480 or later. Refer to the AWS Region Table for Redshift availability, and check the version of your clusters. The new federation capability in Amazon Redshift is released as a public preview and allows you to bring together data stored in Redshift, S3, and one or more RDS and Aurora PostgreSQL databases. When creating a cluster in the Amazon Redshift management console, you can pick three tracks for maintenance: Current, Trailing, or Preview. Within the Preview track, preview_features should be chosen to participate to the Federated Query public preview. For example: These features simplify data processing and analytics, giving you more tools to react quickly, and a single point of view for your data. Let me know what you are going to use them for! — Danilo

Announcing UltraWarm (Preview) for Amazon Elasticsearch Service

Today, we are excited to announce UltraWarm, a fully managed, low-cost, warm storage tier for Amazon Elasticsearch Service. UltraWarm is now available in preview and takes a new approach to providing hot-warm tiering in Amazon Elasticsearch Service, offering up to 900TB of storage, at almost a 90% cost reduction over existing options. UltraWarm is a seamless extension to the Amazon Elasticsearch Service experience, enabling you to query and visualize across both hot and UltraWarm data, all from your familiar Kibana interface. UltraWarm data can be queried using the same APIs and tools you use today, and also supports popular Amazon Elasticsearch Service features like encryption at rest and in flight, integrated alerting, SQL querying, and more. A popular use case for our customers of Amazon Elasticsearch Service is to ingest and analyze high (and increasingly growing) volumes of machine-generated log data. However, those customers tell us that they want to perform real-time analysis on more of this data, so they can use it to help quickly resolve operational and security issues. Storage and analysis of months, or even years, of data has been cost prohibitive for them at scale, causing some to turn to use multiple analytics tools, while others simply delete valuable data, missing out on insights. UltraWarm, with its cost-effective storage backed by Amazon Simple Storage Service (S3), helps solve this problem, enabling customers to retain years of data for analysis. With the launch of UltraWarm, Amazon Elasticsearch Service supports two storage tiers, hot and UltraWarm. The hot tier is used for indexing, updating, and providing the fastest access to data. UltraWarm complements the hot tier to add support for high volumes of older, less-frequently accessed, data to enable you to take advantage of a lower storage cost. As I mentioned earlier, UltraWarm stores data in S3 and uses custom, highly-optimized nodes, built on the AWS Nitro System, to cache, pre-fetch, and query that data. This all contributes to providing an interactive experience when querying and visualizing data. The UltraWarm preview is now available to all customers in the US East (N. Virginia, Ohio) and US West (Oregon) Regions. The UltraWarm tier is available with a pay-as-you-go pricing model, charging for the instance hours for your node, and utilized storage. The UltraWarm preview can be enabled on new Amazon Elasticsearch Service version 6.8 domains. To learn more, visit the technical documentation. — Steve  

Amazon Redshift Update – Next-Generation Compute Instances and Managed, Analytics-Optimized Storage

We launched Amazon Redshift back in 2012 (Amazon Redshift – The New AWS Data Warehouse). With tens of thousands of customers, it is now the world’s most popular data warehouse. Our customers enjoy consistently fast performance, support for complex queries, and transactional capabilities, all with industry-leading price-performance. The original Redshift model establishes a fairly rigid coupling between compute power and storage capacity. You create a cluster with a specific number of instances, and are committed to (and occasionally limited by) the amount of local storage that is provided with each instance. You can access additional compute power with on-demand Concurrency Scaling, and you can use Elastic Resize to scale your clusters up and down in minutes, giving you the ability to adapt to changing compute and storage needs. We think we can do even better! Today we are launching the next generation of Nitro-powered compute instances for Redshift, backed by a new managed storage model that gives you the power to separately optimize your compute power and your storage. This launch takes advantage of some architectural improvements including high-bandwidth networking, managed storage that uses local SSD-based storage backed by Amazon Simple Storage Service (S3), and multiple, advanced data management techniques to optimize data motion to and from S3. Together, these capabilities allow Redshift to deliver 3x the performance of any other cloud data warehouse service, and most existing Amazon Redshift customers using Dense Storage (DS2) instances will get up to 2x better performance and 2x more storage at the same cost. Among many other use cases, this new combo is a great fit for operational analytics, where much of the workload is focused on a small (and often recent) subset of the data in the data warehouse. In the past, customers would unload older data to other types of storage in order to stay within storage limits, leading to additional complexity and making queries on historical data very complex. Next-Generation Compute Instances The new RA3 instances are designed to work hand-in-glove with the new managed storage model. The ra3.16xlarge instances have 48 vCPUs, 384 GiB of Memory, and up to 64 TB of storage. I can create clusters with 2 to 128 instances, giving me over 8 PB of compressed storage: I can also create a new RA3-powered cluster from a snapshot of an existing cluster, or I can use Classic resize to upgrade my cluster to use the new instance type. If you have an existing snapshot or a cluster, you can use the Amazon Redshift console to get a recommended RA3 configuration when you restore or resize. You can also get recommendations from the DescribeNodeConfigurationOptions function or the describe-node-configuration-options command. Managed, Analytics-Optimized Storage The new managed storage is equally exciting. There’s a cache of large-capacity, high-performance SSD-based storage on each instance, backed by S3, for scale, performance, and durability. The storage system uses multiple cues, including data block temperature, data blockage, and workload patterns, to manage the cache for high performance. Data is automatically placed into the appropriate tier, and you need not do anything special to benefit from the caching or the other optimizations. You pay the same low price for SSD and S3 storage, and you can scale the storage capacity of your data warehouse without adding and paying for additional instances. Price & Availability You can start using RA3 instances together with managed storage in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), EU (Frankfurt), EU (Ireland), EU (London). — Jeff;  

Easily Manage Shared Data Sets with Amazon S3 Access Points

Storage that is secure, scalable, durable, and highly available is a fundamental component of cloud computing. That’s why Amazon Simple Storage Service (S3) was the first service launched by AWS, back in 2006. It has been a building block of many of the more than 175 services that AWS now offers. As we approach the beginning of a new decade, capabilities like Amazon Redshift, Amazon Athena, Amazon EMR and AWS Lake Formation have made S3 not just a way to store objects but an engine for turning that data into insights. These capabilities mean that access patterns and requirements for the data stored in buckets have evolved. Today we’re launching a new way to manage data access at scale for shared data sets in S3: Amazon S3 Access Points. S3 Access Points are unique hostnames with dedicated access policies that describe how data can be accessed using that endpoint. Before S3 Access Points, shared access to data meant managing a single policy document on a bucket. These policies could represent hundreds of applications with many differing permissions, making audits, and updates a potential bottleneck affecting many systems. With S3 Access Points, you can add access points as you add additional applications or teams, keeping your policies specific and easier to manage. A bucket can have multiple access points, and each access point has its own AWS Identity and Access Management (IAM) policy. Access point policies are similar to bucket policies, but associated with the access point. S3 Access Points can also be restricted to only allow access from within a Amazon Virtual Private Cloud. And because each access point has a unique DNS name, you can now address your buckets with any name that is unique within your AWS account and region. Creating S3 Access Points Let’s add an access point to a bucket using the S3 Console. You can also create and manage your S3 Access Points using the AWS Command Line Interface (CLI), AWS SDKs, or via the API. I’ve selected a bucket that contains artifacts generated by a AWS Lambda function, and clicked on the access points tab. Let’s create a new access point. I want to give an IAM user Alice permission to GET and PUT objects with the prefix Alice. I’m going to name this access point alices-access-point. There are options for restricting access to a Virtual Private Cloud, which just requires a Virtual Private Cloud ID. In this, I want to allow access from outside the VPC as well, so after I took this screenshot, I selected Internet and moved onto the next step. S3 Access Points makes it easy to block public access. I’m going to block all public access to this access point. And now I can attach my policy. In this policy, our Principal is our user Alice, and the resource is our access point combined with every object with the prefix /Alice. For more examples of the kinds of policies you might want to attach to your S3 Access Points, take a look at the docs. After I create the access point, I can access it by hostname using the format https://[access_point_name]-[accountID].s3-accesspoint.[region] Via the SDKs and CLI, I can use it the same way I would use a bucket once I’ve updated to the latest version. For example, assuming I were authenticated as Alice, I could do the following: $ aws s3api get-object --key /Alice/ --bucket arn:aws:s3:us-east-1:[my-account-id]:alices-access-point Access points that are not restricted to VPCs can also be used via the S3 Console. Things to Know S3 Access Points is available now in all AWS Regions, at no cost. By default each account can create 1,000 access points per region. You can use S3 Access Points with AWS CloudFormation. If you use AWS Organizations, you can add a Service Control Policy (SCP) requiring all access points are restricted to a VPC. When it comes to software design, keeping scopes small and focused on a specific task is almost always a good decision. With S3 Access Points, you can customize hostnames and permissions for any user or application that needs access to your shared data set. Let us know how you like this new capability, and happy building! — Brandon

Amazon EKS on AWS Fargate Now Generally Available

Starting today, you can start using Amazon Elastic Kubernetes Service to run Kubernetes pods on AWS Fargate. EKS and Fargate make it straightforward to run Kubernetes-based applications on AWS by removing the need to provision and manage infrastructure for pods. With AWS Fargate, customers don’t need to be experts in Kubernetes operations to run a cost-optimized and highly-available cluster. Fargate eliminates the need for customers to create or manage EC2 instances for their Amazon EKS clusters. Customers no longer have to worry about patching, scaling, or securing a cluster of EC2 instances to run Kubernetes applications in the cloud. Using Fargate, customers define and pay for resources at the pod-level. This makes it easy to right-size resource utilization for each application and allow customers to clearly see the cost of each pod. I’m now going to use the rest of this blog to explore this new feature further and deploy a simple Kubernetes-based application using Amazon EKS on Fargate. Let’s Build a Cluster The simplest way to get a cluster set up is to use eksctl, the official CLI tool for EKS. The command below creates a cluster called demo-newsblog with no worker nodes. eksctl create cluster --name demo-newsblog --region eu-west-1 --fargate This single command did quite a lot under the hood. Not only did it create a cluster for me, amongst other things, it also created a Fargate profile. A Fargate profile, lets me specify which Kubernetes pods I want to run on Fargate, which subnets my pods run in, and provides the IAM execution role used by the Kubernetes agent to download container images to the pod and perform other actions on my behalf. Understanding Fargate profiles is key to understanding how this feature works. So I am going to delete the Fargate profile that was automatically created for me and recreate it manually. To create a Fargate profile, I head over to the Amazon Elastic Kubernetes Service console and choose the cluster demo-newsblog. On the details, Under Fargate profiles, I choose Add Fargate profile. I then need to configure my new Fargate profile. For the name, I enter demo-default. In the Pod execution role, only IAM roles with the service principal are shown. The eksctl tool creates an IAM role called AmazonEKSFargatePodExecutionRole, the documentation shows how this role can be created from scratch. In the Subnets section, by default, all subnets in my cluster’s VPC are selected. However, only private subnets are supported for Fargate pods, so I deselect the two public subnets. When I click next, I am taken to the Pod selectors screen. Here it asks me to enter a namespace. I add default, meaning that I want any pods that are created in the default Kubernetes namespace to run on Fargate. It’s important to understand that I don’t have to modify my Kubernetes app to get the pods running on Fargate, I just need a Fargate Profile – if a pod in my Kubernetes app matches the namespace defined in my profile, that pod will run on Fargate. There is also a Match labels feature here, which I am not using. This allows you to specify the labels of the pods that you want to select, so you can get even more specific with which pods run on this profile. Finally, I click Next and then Create. It takes a minute for the profile to create and become active. In this demo, I also want everything to run on Fargate, including the CoreDNS pods that are part of Kubernetes. To get them running on Fargate, I will add a second Fargate profile for everything in the kube-system namespace. This time, to add a bit of variety to the demo, I will use the command line to create my profile. Technically, I do not need to create a second profile for this. I could have added an additional namespace to the first profile, but this way, I get to explore an alternative way of creating a profile. First, I create the file below and save it as demo-kube-system-profile.json. { "fargateProfileName": "demo-kube-system", "clusterName": "demo-news-blog", "podExecutionRoleArn": "arn:aws:iam::xxx:role/AmazonEKSFargatePodExecutionRole", "subnets": [ "subnet-0968a124a4e4b0afe", "subnet-0723bbe802a360eb9" ], "selectors": [ { "namespace": "kube-system" } ] } I then navigate to the folder that contains the file above and run the create-fargate-profile command in my terminal. aws eks create-fargate-profile --cli-input-json file://demo-kube-system-profile.json I am now ready to deploy a container to my cluster. To keep things simple, I deploy a single instance of nginx using the following kubectl command. kubectl create deployment demo-app --image=nginx I then check to see the state of my pods by running the get pods command. kubectl get pods NAME READY STATUS RESTARTS AGE demo-app-6dbfc49497-67dxk 0/1 Pending 0 13s If I run get nodes  I have three nodes (two for coreDNS and one for nginx). These nodes represent the compute resources that have instantiated for me to run my pods. kubectl get nodes NAME STATUS ROLES AGE VERSION Ready <none> 4m45s v1.14.8-eks Ready <none> 2m20s v1.14.8-eks Ready <none> 4m40s v1.14.8-eks After a short time, I rerun the get pods command, and my demo-app now has a status of Running. Meaning my container has been successfully deployed onto Fargate. kubectl get pods NAME READY STATUS RESTARTS AGE demo-app-6dbfc49497-67dxk 1/1 Running 0 3m52s Pricing and Limitations With AWS Fargate, you pay only for the amount of vCPU and memory resources that your pod needs to run. This includes the resources the pod requests in addition to a small amount of memory needed to run Kubernetes components alongside the pod. Pods running on Fargate follow the existing pricing model. vCPU and memory resources are calculated from the time your pod’s container images are pulled until the pod terminates, rounded up to the nearest second. A minimum charge for 1 minute applies. Additionally, you pay the standard cost for each EKS cluster you run, $0.20 per hour. There are currently a few limitations that you should be aware of: There is a maximum of 4 vCPU and 30Gb memory per pod. Currently there is no support for stateful workloads that require persistent volumes or file systems. You cannot run Daemonsets, Privileged pods, or pods that use HostNetwork or HostPort. The only load balancer you can use is an Application Load Balancer. Get Started Today If you want to explore Amazon EKS on AWS Fargate yourself, you can try it now by heading on over to the EKS console in the following regions: US East (N. Virginia), US East (Ohio), Europe (Ireland), and Asia Pacific (Tokyo). — Martin

Identify Unintended Resource Access with AWS Identity and Access Management (IAM) Access Analyzer

Today I get to share my favorite kind of announcement. It’s the sort of thing that will improve security for just about everyone that builds on AWS, it can be turned on with almost no configuration, and it costs nothing to use. We’re launching a new, first-of-its-kind capability called AWS Identity and Access Management (IAM) Access Analyzer. IAM Access Analyzer mathematically analyzes access control policies attached to resources and determines which resources can be accessed publicly or from other accounts. It continuously monitors all policies for Amazon Simple Storage Service (S3) buckets, IAM roles, AWS Key Management Service (KMS) keys, AWS Lambda functions, and Amazon Simple Queue Service (SQS) queues. With IAM Access Analyzer, you have visibility into the aggregate impact of your access controls, so you can be confident your resources are protected from unintended access from outside of your account. Let’s look at a couple examples. An IAM Access Analyzer finding might indicate an S3 bucket named my-bucket-1 is accessible to an AWS account with the id 123456789012 when originating from the source IP Or IAM Access Analyzer may detect a KMS key policy that allow users from another account to delete the key, identifying a data loss risk you can fix by adjusting the policy. If the findings show intentional access paths, they can be archived. So how does it work? Using the kind of math that shows up on unexpected final exams in my nightmares, IAM Access Analyzer evaluates your policies to determine how a given resource can be accessed. Critically, this analysis is not based on historical events or pattern matching or brute force tests. Instead, IAM Access Analyzer understands your policies semantically. All possible access paths are verified by mathematical proofs, and thousands of policies can be analyzed in a few seconds. This is done using a type of cognitive science called automated reasoning. IAM Access Analyzer is the first service powered by automated reasoning available to builders everywhere, offering functionality unique to AWS. To start learning about automated reasoning, I highly recommend this short video explainer. If you are interested in diving a bit deeper, check out this re:Invent talk on automated reasoning from Byron Cook, Director of the AWS Automated Reasoning Group. And if you’re really interested in understanding the methodology, make yourself a nice cup of chamomile tea, grab a blanket, and get cozy with a copy of Semantic-based Automated Reasoning for AWS Access Policies using SMT. Turning on IAM Access Analyzer is way less stressful than an unexpected nightmare final exam. There’s just one step. From the IAM Console, select Access analyzer from the menu on the left, then click Create analyzer. Analyzers generate findings in the account from which they are created. Analyzers also work within the region defined when they are created, so create one in each region for which you’d like to see findings. Once our analyzer is created, findings that show accessible resources appear in the Console. My account has a few findings that are worth looking into, such as KMS keys and IAM roles that are accessible by other accounts and federated users. I’m going to click on the first finding and take a look at the access policy for this KMS key. From here we can see the open access paths and details about the resources and principals involved. I went over to the KMS console and confirmed that this is intended access, so I archived this particular finding. All IAM Access Analyzer findings are visible in the IAM Console, and can also be accessed using the IAM Access Analyzer API. Findings related to S3 buckets can be viewed directly in the S3 Console. Bucket policies can then be updated right in the S3 Console, closing the open access pathway. You can also see high-priority findings generated by IAM Access Analyzer in AWS Security Hub, ensuring a comprehensive, single source of truth for your compliance and security-focused team members. IAM Access Analyzer also integrates with CloudWatch Events, making it easy to automatically respond to or send alerts regarding findings through the use of custom rules. Now that you’ve seen how IAM Access Analyzer provides a comprehensive overview of cloud resource access, you should probably head over to IAM and turn it on. One of the great advantages of building in the cloud is that the infrastructure and tools continue to get stronger over time and IAM Access Analyzer is a great example. Did I mention that it’s free? Fire it up, then send me a tweet sharing some of the interesting things you find. As always, happy building! — Brandon


Recommended Content