Industry Buzz

AWS Outposts Now Available – Order Yours Today!

Amazon Web Services Blog -

We first discussed AWS Outposts at re:Invent 2018. Today, I am happy to announce that we are ready to take orders and install Outposts racks in your data center or colo facility. Why Outposts? This new and unique AWS offering is a comprehensive, single-vendor compute & storage solution that is designed to meet the needs of customers who need local processing and very low latency. You no longer need to spend time creating detailed hardware specifications, soliciting & managing bids from multiple disparate vendors, or racking & stacking individual servers. Instead, you place your order online, take delivery, and relax while trained AWS technicians install, connect, set up, and verify your Outposts. Once installed, we take care of monitoring, maintaining, and upgrading your Outposts. All of the hardware is modular and can be replaced in the field without downtime. When you need more processing or storage, or want to upgrade to newer generations of EC2 instances, you can initiate the request with a couple of clicks and we will take care of the rest. Everything that you and your team already know about AWS still applies. You use the same APIs, tools, and operational practices. You can create a single deployment pipeline that target your Outposts and your cloud-based environments, and you can create hybrid architectures that span both. Each Outpost is connected to and controlled by a specific AWS Region. The region treats a collection of up to 16 racks at a single location as a unified capacity pool. The collection can be associated with subnets of one or more VPCs in the parent region. Outposts Hardware The Outposts hardware is the same as what we use in our own data centers, with some additional security devices. The hardware is designed for reliability & efficiency, with redundant network switches and power supplies, and DC power distribution. Outpost racks are 80″ tall, 24″ wide, 48″ deep, and can weigh up to 2000 lbs. They arrive fully assembled, and roll in on casters, ready for connection to power and networking. To learn more about the Outposts hardware, watch my colleague Anthony Liguori explain it: Outposts supports multiple Intel®-powered Nitro-based EC2 instance types including C5, C5d, M5, M5d, R5, R5d, G4, and I3en. You can choose the mix of types that is right for your environment, and you can add more later. You will also be able to upgrade to newer instance types as they become available. On the storage side, Outposts support EBS gp2 (general purpose SSD) storage, with a minimum size of 2.7 TB. Outpost Networking Each Outpost has a pair of networking devices, each with 400 Gbps of connectivity and support for 1 GigE, 10 GigE, 40 GigE, and 100 Gigabit fiber connections. The connections are used to host a pair of Link Aggregation Groups, one for the link to the parent region, and another to your local network. The link to the parent region is used for control and VPC traffic; all connections originate from the Outpost. Traffic to and from your local network flows through a Local Gateway (LGW), giving you full control over access and routing. Here’s an overview of the networking topology within your premises: You will need to allocate a /26 CIDR block to each Outpost, which is advertised as a pair of /27 blocks in order to protect against device and link failures. The CIDR block can be within your own range of public IP addresses, or it can be an RFC 1918 private address plus NAT at your network edge. Outposts are simply new subnets on an existing VPC in the parent region. Here’s how to create one: $ aws ec2 create-subnet --vpc-id VVVVVV \ --cidr-block A.B.C.D/24 \ --outpost-arn arn:aws:outposts:REGION:ACCOUNT_ID:outpost:OUTPOST_ID If you have Cisco or Juniper hardware in your data center, the following guides will be helpful: Cisco – Outposts Solution Overview. To learn more about the partnership between AWS and Cisco, visit this page. Juniper – AWS Outposts in a Juniper QFX-Based Datacenter. In most cases you will want to use AWS Direct Connect to establish a connection between your Outposts and the parent AWS Region. For more information on this and to learn a lot more about how to plan your Outposts network model, consult the How it Works documentation. Outpost Services We are launching with support for Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS), Amazon Virtual Private Cloud, Amazon ECS, Amazon Elastic Kubernetes Service, and Amazon EMR, with additional services in the works. Amazon RDS for PostgreSQL and Amazon RDS for MySQL are available in preview form. Your applications can also make use of any desired services in the parent region, including Amazon Simple Storage Service (S3), Amazon DynamoDB, Auto Scaling, AWS CloudFormation, Amazon CloudWatch, AWS CloudTrail, AWS Config, Load Balancing, and so forth. You can create and use Interface Endpoints from within the VPC, or you can access the services through the regional public endpoints. Services & applications in the parent region that launch, manage, or refer to EC2 instances or EBS volumes can operate on those objects within an Outpost with no changes. Purchasing an Outpost The process of purchasing an Outpost is a bit more involved than that of launching an EC2 instance or creating an S3 bucket, but it should be straightforward. I don’t actually have a data center, and won’t actually take delivery of an Outpost, but I’ll do my best to show you the actual experience! The first step is to describe and qualify my site. I enter my address: I confirm temperature, humidity, and airflow at the rack position, that my loading dock can accommodate the shipping crate, and that there’s a clear access path from the loading dock to the rack’s final resting position: I provide information about my site’s power configuration: And the networking configuration: After I create the site, I create my Outpost: Now I am ready to order my hardware. I can choose any one of 18 standard configurations, with varied amounts of compute capacity and storage (custom configurations are also available), and click Create order to proceed: The EC2 capacity shown above indicates the largest instance size of a particular type. I can launch instances of that size, or I can use the smaller sizes, as needed. For example, the the capacity of the OR-HUZEI16 configuration that I selected is listed as 7 m5.24xlarge instances and 3 c5.24xlarge instances. I could launch a total of 10 instances in those sizes, or (if I needed lots of smaller ones) I could launch 168 m5.xlarge instances and 72 c5.xlarge instances. I could also use a variety of sizes, subject to available capacity and the details of how the instances are assigned to the hardware. I confirm my order, choose the Outpost that I created earlier, and click Submit order: My order will be reviewed, my colleagues might give me a call to review some details, and my Outpost will be shipped to my site. A team of AWS installers will arrive to unpack & inspect the Outpost, transport it to its resting position in my data center, and work with my data center operations (DCO) team to get it connected and powered up. Once the Outpost is powered up and the network is configured, it will set itself up automatically. At that point I can return to the console and monitor capacity exceptions (situations where demand exceeds supply), capacity availability, and capacity utilization: Using an Outpost The next step is to set up one or more subnets in my Outpost, as shown above. Then I can launch EC2 instances and create EBS volumes in the subnet, just as I would with any other VPC subnet. I can ask for more capacity by selecting Increase capacity from the Actions menu: The AWS team will contact me within 3 business days to discuss my options. Things to Know Here are a couple of other things to keep in mind when thinking about using Outposts: Availability – Outposts are available in the following countries: North America (United States) Europe (All EU countries, Switzerland, Norway) Asia Pacific (Japan, South Korea, Australia) Support – You must subscribe to AWS Enterprise Support in order to purchase an Outpost. We will remotely monitor your Outpost, and keep it happy & healthy over time. We’ll look for failing components and arrange to replace them without disturbing your operations. Billing & Payment Options – You can purchase Outposts on a three-year term, with All Upfront, Partial Upfront, and No Upfront payment options. The purchase price covers all EC2 and EBS usage within the Outpost; other services are billed by the hour, with the EC2 and EBS portions removed. You pay the regular inter-AZ data transfer charge to move data between an Outpost and another subnet in the same VPC, and the usual AWS data transfer charge for data that exits to the Internet across the link to the parent region. Capacity Expansion – Today, you can group up to 16 racks into a single capacity pool. Over time we expect to allow you to group thousands of racks together in this manner. Stay Tuned This is, like most AWS announcements, just the starting point. We have a lot of cool stuff in the works, and it is still Day One for AWS Outposts! — Jeff;  

AWS Now Available from a Local Zone in Los Angeles

Amazon Web Services Blog -

AWS customers are always asking for more features, more bandwidth, more compute power, and more memory, while also asking for lower latency and lower prices. We do our best to meet these competing demands: we launch new EC2 instance types, EBS volume types, and S3 storage classes at a rapid pace, and we also reduce prices regularly. AWS in Los Angeles Today we are launching a Local Zone in Los Angeles, California. The Local Zone is a new type of AWS infrastructure deployment that brings select AWS services very close to a particular geographic area. This Local Zone is designed to provide very low latency (single-digit milliseconds) to applications that are accessed from Los Angeles and other locations in Southern California. It will be of particular interest to highly-demanding applications that are particularly sensitive to latency. This includes: Media & Entertainment – Gaming, 3D modeling & rendering, video processing (including real-time color correction), video streaming, and media production pipelines. Electronic Design Automation – Interactive design & layout, simulation, and verification. Ad-Tech – Rapid decision making & ad serving. Machine Learning – Fast, continuous model training; high-performance low-latency inferencing. All About Local Zones The new Local Zone in Los Angeles is a logical part of the US West (Oregon) Region (which I will refer to as the parent region), and has some unique and interesting characteristics: Naming – The Local Zone can be accessed programmatically as us-west-2-lax-1a. All API, CLI, and Console access takes place through the us-west-2 API endpoint and the US West (Oregon) Console. Opt-In – You will need to opt in to the Local Zone in order to use it. After opting in, you can create a new VPC subnet in the Local Zone, taking advantage of all relevant VPC features including Security Groups, Network ACLs, and Route Tables. You can target the Local Zone when you launch EC2 instances and other resources, or you can create a default subnet in the VPC and have it happen automatically. Networking – The Local Zone in Los Angeles is connected to US West (Oregon) over Amazon’s private backbone network. Connections to the public internet take place across an Internet Gateway, giving you local ingress and egress to reduce latency. Elastic IP Addresses can be shared by a group of Local Zones in a particular geographic location, but they do not move between a Local Zone and the parent region. The Local Zone also supports AWS Direct Connect, giving you the opportunity to route your traffic over a private network connection. Services – We are launching with support for seven EC2 instance types (T3, C5, M5, R5, R5d, I3en, and G4), two EBS volume types (io1 and gp2), Amazon FSx for Windows File Server, Amazon FSx for Lustre, Application Load Balancer, and Amazon Virtual Private Cloud. Single-Zone RDS is on the near-term roadmap, and other services will come later based on customer demand. Applications running in a Local Zone can also make use of services in the parent region. Parent Region – As I mentioned earlier, the new Local Zone is a logical extension of the US West (Oregon) region, and is managed by the “control plane” in the region. API calls, CLI commands, and the AWS Management Console should use “us-west-2” or US West (Oregon). AWS – Other parts of AWS will continue to work as expected after you start to use this Local Zone. Your IAM resources, CloudFormation templates, and Organizations are still relevant and applicable, as are your tools and (perhaps most important) your investment in AWS training. Pricing & Billing – Instances and other AWS resources in Local Zones will have different prices than in the parent region. Billing reports will include a prefix that is specific to a group of Local Zones that share a physical location. EC2 instances are available in On Demand & Spot form, and you can also purchase Savings Plans. Using a Local Zone The first Local Zone is available today, and you can request access here: In early 2020, you will be able opt in using the console, CLI, or by API call. After opting in, I can list my AZs and see that the Local Zone is included: Then I create a new VPC subnet for the Local Zone. This gives me transparent, seamless connectivity between the parent zone in Oregon and the Local Zone in Los Angeles, all within the VPC: I can create EBS volumes: They are, as usual, ready within seconds: I can also see and use the Local Zone from within the AWS Management Console: I can also use the AWS APIs, CloudFormation templates, and so forth. Thinking Ahead Local Zones give you even more architectural flexibility. You can think big, and you can think different! You now have the components, tools, and services at your fingertips to build applications that make use of any conceivable combination of legacy on-premises resources, modern on-premises cloud resources via AWS Outposts, resources in a Local Zone, and resources in one or more AWS regions. In the fullness of time (as Andy Jassy often says), there could very well be more than one Local Zone in any given geographic area. In 2020, we will open a second one in Los Angeles (us-west-2-lax-1b), and are giving consideration to other locations. We would love to get your advice on locations, so feel free to leave me a comment or two! Now Available The Local Zone in Los Angeles is available now and you can start using it today. Learn more about Local Zones. — Jeff;  

Amazon SageMaker Studio: The First Fully Integrated Development Environment For Machine Learning

Amazon Web Services Blog -

Today, we’re extremely happy to launch Amazon SageMaker Studio, the first fully integrated development environment (IDE) for machine learning (ML). We have come a long way since we launched Amazon SageMaker in 2017, and it is shown in the growing number of customers using the service. However, the ML development workflow is still very iterative, and is challenging for developers to manage due to the relative immaturity of ML tooling. Many of the tools which developers take for granted when building traditional software (debuggers, project management, collaboration, monitoring, and so forth) have yet been invented for ML. For example, when trying a new algorithm or tweaking hyper parameters, developers and data scientists typically run hundreds and thousands of experiments on Amazon SageMaker, and they need to manage all this manually. Over time, it becomes much harder to track the best performing models, and to capitalize on lessons learned during the course of experimentation. Amazon SageMaker Studio unifies at last all the tools needed for ML development. Developers can write code, track experiments, visualize data, and perform debugging and monitoring all within a single, integrated visual interface, which significantly boosts developer productivity. In addition, since all these steps of the ML workflow are tracked within the environment, developers can quickly move back and forth between steps, and also clone, tweak, and replay them. This gives developers the ability to make changes quickly, observe outcomes, and iterate faster, reducing the time to market for high quality ML solutions. Introducing Amazon SageMaker Studio Amazon SageMaker Studio lets you manage your entire ML workflow through a single pane of glass. Let me give you the whirlwind tour! With Amazon SageMaker Notebooks (currently in preview), you can enjoy an enhanced notebook experience that lets you easily create and share Jupyter notebooks. Without having to manage any infrastructure, you can also quickly switch from one hardware configuration to another. With Amazon SageMaker Experiments, you can organize, track and compare thousands of ML jobs: these can be training jobs, or data processing and model evaluation jobs run with Amazon SageMaker Processing. With Amazon SageMaker Debugger, you can debug and analyze complex training issues, and receive alerts. It automatically introspects your models, collects debugging data, and analyzes it to provide real-time alerts and advice on ways to optimize your training times, and improve model quality. All information is visible as your models are training. With Amazon SageMaker Model Monitor, you can detect quality deviations for deployed models, and receive alerts. You can easily visualize issues like data drift that could be affecting your models. No code needed: all it takes is a few clicks. With Amazon SageMaker Autopilot, you can build models automatically with full control and visibility. Algorithm selection, data preprocessing, and model tuning are taken care automatically, as well as all infrastructure. Thanks to these new capabilities, Amazon SageMaker now covers the complete ML workflow to build, train, and deploy machine learning models, quickly and at any scale. These services mentioned above, except for Amazon SageMaker Notebooks, are covered in individual blog posts (see below) showing you how to quickly get started, so keep your eyes peeled and read on! Amazon SageMaker Debugger Amazon SageMaker Model Monitor Amazon SageMaker Autopilot Amazon SageMaker Experiments Now Available! Amazon SageMaker Studio is available today in US East (Ohio). Give it a try, and please send us feedback either in the AWS forum for Amazon SageMaker, or through your usual AWS support contacts. - Julien

Amazon SageMaker Debugger – Debug Your Machine Learning Models

Amazon Web Services Blog -

Today, we’re extremely happy to announce Amazon SageMaker Debugger, a new capability of Amazon SageMaker that automatically identifies complex issues developing in machine learning (ML) training jobs. Building and training ML models is a mix of science and craft (some would even say witchcraft). From collecting and preparing data sets to experimenting with different algorithms to figuring out optimal training parameters (the dreaded hyperparameters), ML practitioners need to clear quite a few hurdles to deliver high-performance models. This is the very reason why be built Amazon SageMaker : a modular, fully managed service that simplifies and speeds up ML workflows. As I keep finding out, ML seems to be one of Mr. Murphy’s favorite hangouts, and everything that may possibly go wrong often does! In particular, many obscure issues can happen during the training process, preventing your model from correctly extracting and learning patterns present in your data set. I’m not talking about software bugs in ML libraries (although they do happen too): most failed training jobs are caused by an inappropriate initialization of parameters, a poor combination of hyperparameters, a design issue in your own code, etc. To make things worse, these issues are rarely visible immediately: they grow over time, slowly but surely ruining your training process, and yielding low accuracy models. Let’s face it, even if you’re a bonafide expert, it’s devilishly difficult and time-consuming to identify them and hunt them down, which is why we built Amazon SageMaker Debugger. Let me tell you more. Introducing Amazon SageMaker Debugger In your existing training code for TensorFlow, Keras, Apache MXNet, PyTorch and XGBoost, you can use the new SageMaker Debugger SDK to save internal model state at periodic intervals; as you can guess, it will be stored in Amazon Simple Storage Service (S3). This state is composed of: The parameters being learned by the model, e.g. weights and biases for neural networks, The changes applied to these parameters by the optimizer, aka gradients, The optimization parameters themselves, Scalar values, e.g. accuracies and losses, The output of each layer, Etc. Each specific set of values – say, the sequence of gradients flowing over time through a specific neural network layer – is saved independently, and referred to as a tensor. Tensors are organized in collections (weights, gradients, etc.), and you can decide which ones you want to save during training. Then, using the SageMaker SDK and its estimators, you configure your training job as usual, passing additional parameters defining the rules you want SageMaker Debugger to apply. A rule is a piece of Python code that analyses tensors for the model in training, looking for specific unwanted conditions. Pre-defined rules are available for common problems such as exploding/vanishing tensors (parameters reaching NaN or zero values), exploding/vanishing gradients, loss not changing, and more. Of course, you can also write your own rules. Once the SageMaker estimator is configured, you can launch the training job. Immediately, it fires up a debug job for each rule that you configured, and they start inspecting available tensors. If a debug job detects a problem, it stops and logs additional information. A CloudWatch Events event is also sent, should you want to trigger additional automated steps. So now you know that your deep learning job suffers from say, vanishing gradients. With a little brainstorming and experience, you’ll know where to look: maybe the neural network is too deep? Maybe your learning rate is too small? As the internal state has been saved to S3, you can now use the SageMaker Debugger SDK to explore the evolution of tensors over time, confirm your hypothesis and fix the root cause. Let’s see SageMaker Debugger in action with a quick demo. Debugging Machine Learning Models with Amazon SageMaker Debugger At the core of SageMaker Debugger is the ability to capture tensors during training. This requires a little bit of instrumentation in your training code, in order to select the tensor collections you want to save, the frequency at which you want to save them, and whether you want to save the values themselves or a reduction (mean, average, etc.). For this purpose, the SageMaker Debugger SDK provides simple APIs for each framework that it supports. Let me show you how this works with a simple TensorFlow script, trying to fit a 2-dimension linear regression model. Of course, you’ll find more examples in this Github repository. Let’s take a look at the initial code: import argparse import numpy as np import tensorflow as tf import random parser = argparse.ArgumentParser() parser.add_argument('--model_dir', type=str, help="S3 path for the model") parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001) parser.add_argument('--steps', type=int, help="Number of steps to run", default=100) parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0) args = parser.parse_args() with tf.name_scope('initialize'): # 2-dimensional input sample x = tf.placeholder(shape=(None, 2), dtype=tf.float32) # Initial weights: [10, 10] w = tf.Variable(initial_value=[[10.], [10.]], name='weight1') # True weights, i.e. the ones we're trying to learn w0 = [[1], [1.]] with tf.name_scope('multiply'): # Compute true label y = tf.matmul(x, w0) # Compute "predicted" label y_hat = tf.matmul(x, w) with tf.name_scope('loss'): # Compute loss loss = tf.reduce_mean((y_hat - y) ** 2, name="loss") optimizer = tf.train.AdamOptimizer(args.lr) optimizer_op = optimizer.minimize(loss) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for i in range(args.steps): x_ = np.random.random((10, 2)) * args.scale _loss, opt = sess.run([loss, optimizer_op], {x: x_}) print (f'Step={i}, Loss={_loss}') Let’s train this script using the TensorFlow Estimator. I’m using SageMaker local mode, which is a great way to quickly iterate on experimental code. bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000} estimator = TensorFlow( role=sagemaker.get_execution_role(), base_job_name='debugger-simple-demo', train_instance_count=1, train_instance_type='local', entry_point='script-v1.py', framework_version='1.13.1', py_version='py3', script_mode=True, hyperparameters=bad_hyperparameters) Looking at the training log, things did not go well. Step=0, Loss=7.883463958023267e+23 algo-1-hrvqg_1 | Step=1, Loss=9.502028841062608e+23 algo-1-hrvqg_1 | Step=2, Loss=nan algo-1-hrvqg_1 | Step=3, Loss=nan algo-1-hrvqg_1 | Step=4, Loss=nan algo-1-hrvqg_1 | Step=5, Loss=nan algo-1-hrvqg_1 | Step=6, Loss=nan algo-1-hrvqg_1 | Step=7, Loss=nan algo-1-hrvqg_1 | Step=8, Loss=nan algo-1-hrvqg_1 | Step=9, Loss=nan Loss does not decrease at all, and even goes to infinity… This looks like an exploding tensor problem, which is one of the built-in rules defined in SageMaker Debugger. Let’s get to work. Using the Amazon SageMaker Debugger SDK In order to capture tensors, I need to instrument the training script with: A SaveConfig object specifying the frequency at which tensors should be saved, A SessionHook object attached to the TensorFlow session, putting everything together and saving required tensors during training, An (optional) ReductionConfig object, listing tensor reductions that should be saved instead of full tensors, An (optional) optimizer wrapper to capture gradients. Here’s the updated code, with extra command line arguments for SageMaker Debugger parameters. import argparse import numpy as np import tensorflow as tf import random import smdebug.tensorflow as smd parser = argparse.ArgumentParser() parser.add_argument('--model_dir', type=str, help="S3 path for the model") parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001 ) parser.add_argument('--steps', type=int, help="Number of steps to run", default=100 ) parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0 ) parser.add_argument('--debug_path', type=str, default='/opt/ml/output/tensors') parser.add_argument('--debug_frequency', type=int, help="How often to save tensor data", default=10) feature_parser = parser.add_mutually_exclusive_group(required=False) feature_parser.add_argument('--reductions', dest='reductions', action='store_true', help="save reductions of tensors instead of saving full tensors") feature_parser.add_argument('--no_reductions', dest='reductions', action='store_false', help="save full tensors") args = parser.parse_args() args = parser.parse_args() reduc = smd.ReductionConfig(reductions=['mean'], abs_reductions=['max'], norms=['l1']) if args.reductions else None hook = smd.SessionHook(out_dir=args.debug_path, include_collections=['weights', 'gradients', 'losses'], save_config=smd.SaveConfig(save_interval=args.debug_frequency), reduction_config=reduc) with tf.name_scope('initialize'): # 2-dimensional input sample x = tf.placeholder(shape=(None, 2), dtype=tf.float32) # Initial weights: [10, 10] w = tf.Variable(initial_value=[[10.], [10.]], name='weight1') # True weights, i.e. the ones we're trying to learn w0 = [[1], [1.]] with tf.name_scope('multiply'): # Compute true label y = tf.matmul(x, w0) # Compute "predicted" label y_hat = tf.matmul(x, w) with tf.name_scope('loss'): # Compute loss loss = tf.reduce_mean((y_hat - y) ** 2, name="loss") hook.add_to_collection('losses', loss) optimizer = tf.train.AdamOptimizer(args.lr) optimizer = hook.wrap_optimizer(optimizer) optimizer_op = optimizer.minimize(loss) hook.set_mode(smd.modes.TRAIN) with tf.train.MonitoredSession(hooks=[hook]) as sess: for i in range(args.steps): x_ = np.random.random((10, 2)) * args.scale _loss, opt = sess.run([loss, optimizer_op], {x: x_}) print (f'Step={i}, Loss={_loss}') I also need to modify the TensorFlow Estimator, to use the SageMaker Debugger-enabled training container and to pass additional parameters. bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000, 'debug_frequency': 1} from sagemaker.debugger import Rule, rule_configs estimator = TensorFlow( role=sagemaker.get_execution_role(), base_job_name='debugger-simple-demo', train_instance_count=1, train_instance_type='ml.c5.2xlarge', image_name=cpu_docker_image_name, entry_point='script-v2.py', framework_version='1.15', py_version='py3', script_mode=True, hyperparameters=bad_hyperparameters, rules = [Rule.sagemaker(rule_configs.exploding_tensor())] ) estimator.fit() 2019-11-27 10:42:02 Starting - Starting the training job... 2019-11-27 10:42:25 Starting - Launching requested ML instances ********* Debugger Rule Status ********* * * ExplodingTensor: InProgress * **************************************** Two jobs are running: the actual training job, and a debug job checking for the rule defined in the Estimator. Quickly, the debug job fails! Describing the training job, I can get more information on what happened. description = client.describe_training_job(TrainingJobName=job_name) print(description['DebugRuleEvaluationStatuses'][0]['RuleConfigurationName']) print(description['DebugRuleEvaluationStatuses'][0]['RuleEvaluationStatus']) ExplodingTensor IssuesFound Let’s take a look at the saved tensors. Exploring Tensors I can easily grab the tensors saved in S3 during the training process. s3_output_path = description["DebugConfig"]["DebugHookConfig"]["S3OutputPath"] trial = create_trial(s3_output_path) Let’s list available tensors. trial.tensors() ['loss/loss:0', 'gradients/multiply/MatMul_1_grad/tuple/control_dependency_1:0', 'initialize/weight1:0'] All values are numpy arrays, and I can easily iterate over them. tensor = 'gradients/multiply/MatMul_1_grad/tuple/control_dependency_1:0' for s in list(trial.tensor(tensor).steps()): print("Value: ", trial.tensor(tensor).step(s).value) Value: [[1.1508383e+23] [1.0809098e+23]] Value: [[1.0278440e+23] [1.1347468e+23]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] Value: [[nan] [nan]] As tensor names include the TensorFlow scope defined in the training code, I can easily see that something is wrong with my matrix multiplication. # Compute true label y = tf.matmul(x, w0) # Compute "predicted" label y_hat = tf.matmul(x, w) Digging a little deeper, the x input is modified by a scaling parameter, which I set to 100000000000 in the Estimator. The learning rate doesn’t look sane either. Bingo! x_ = np.random.random((10, 2)) * args.scale bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000, 'debug_frequency': 1} As you probably knew all along, setting these hyperparameters to more reasonable values will fix the training issue. Now Available! We believe Amazon SageMaker Debugger will help you find and solve training issues quicker, so it’s now your turn to go bug hunting. Amazon SageMaker Debugger is available today in all commercial regions where Amazon SageMaker is available. Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts. - Julien    

Amazon SageMaker Model Monitor – Fully Managed Automatic Monitoring For Your Machine Learning Models

Amazon Web Services Blog -

Today, we’re extremely happy to announce Amazon SageMaker Model Monitor, a new capability of Amazon SageMaker that automatically monitors machine learning (ML) models in production, and alerts you when data quality issues appear. The first thing I learned when I started working with data is that there is no such thing as paying too much attention to data quality. Raise your hand if you’ve spent hours hunting down problems caused by unexpected NULL values or by exotic character encodings that somehow ended up in one of your databases. As models are literally built from large amounts of data, it’s easy to see why ML practitioners spend so much time caring for their data sets. In particular, they make sure that data samples in the training set (used to train the model) and in the validation set (used to measure its accuracy) have the same statistical properties. There be monsters! Although you have full control over your experimental data sets, the same can’t be said for real-life data that your models will receive. Of course, that data will be unclean, but a more worrisome problem is “data drift”, i.e. a gradual shift in the very statistical nature of the data you receive. Minimum and maximum values, mean, average, variance, and more: all these are key attributes that shape assumptions and decisions made during the training of a model. Intuitively, you can surely feel that any significant change in these values would impact the accuracy of predictions: imagine a loan application predicting higher amounts because input features are drifting or even missing! Detecting these conditions is pretty difficult: you would need to capture data received by your models, run all kinds of statistical analysis to compare that data to the training set, define rules to detect drift, send alerts if it happens… and do it all over again each time you update your models. Expert ML practitioners certainly know how to build these complex tools, but at the great expense of time and resources. Undifferentiated heavy lifting strikes again… To help all customers focus on creating value instead, we built Amazon SageMaker Model Monitor. Let me tell you more. Introducing Amazon SageMaker Model Monitor A typical monitoring session goes like this. You first start from a SageMaker endpoint to monitor, either an existing one, or a new one created specifically for monitoring purposes. You can use SageMaker Model Monitor on any endpoint, whether the model was trained with a built-in algorithm, a built-in framework, or your own container. Using the SageMaker SDK, you can capture a configurable fraction of the data sent to the endpoint (you can also capture predictions if you’d like), and store it in one of your Amazon Simple Storage Service (S3) buckets. Captured data is enriched with metadata (content type, timestamp, etc.), and you can secure and access it just like any S3 object. Then, you create a baseline from the data set that was used to train the model deployed on the endpoint (of course, you can reuse an existing baseline, too). This will fire up a Amazon SageMaker Processing job where SageMaker Model Monitor will: Infer a schema for the input data, i.e. type and completeness information for each feature. You should review it, and update it if needed. For pre-built containers only, compute feature statistics using Deequ, an open source tool based on Apache Spark that is developed and used at Amazon (blog post and research paper). These statistics include KLL sketches, an advanced technique to compute accurate quantiles on streams of data, that we recently contributed to Deequ. Using these artifacts, the next step is to launch a monitoring schedule, to let SageMaker Model Monitor inspect collected data and prediction quality. Whether you’re using a built-in or custom container, a number of built-in rules are applied, and reports are periodically pushed to S3. The reports contain statistics and schema information on the data received during the latest time frame, as well as any violation that was detected. Last but not least, SageMaker Model Monitor emits per-feature metrics to Amazon CloudWatch, which you can use to set up dashboards and alerts. The summary metrics from CloudWatch are also visible in Amazon SageMaker Studio, and of course all statistics, monitoring results and data collected can be viewed and further analyzed in a notebook. For more information and an example on how to use SageMaker Model Monitor using AWS CloudFormation, refer to the developer guide. Now, let’s do a demo, using a churn prediction model trained with the built-in XGBoost algorithm. Enabling Data Capture The first step is to create an endpoint configuration to enable data capture. Here, I decide to capture 100% of incoming data, as well as model output (i.e. predictions). I’m also passing the content types for CSV and JSON data. data_capture_configuration = { "EnableCapture": True, "InitialSamplingPercentage": 100, "DestinationS3Uri": s3_capture_upload_path, "CaptureOptions": [ { "CaptureMode": "Output" }, { "CaptureMode": "Input" } ], "CaptureContentTypeHeader": { "CsvContentTypes": ["text/csv"], "JsonContentTypes": ["application/json"] } Next, I create the endpoint using the usual CreateEndpoint API. create_endpoint_config_response = sm_client.create_endpoint_config( EndpointConfigName = endpoint_config_name, ProductionVariants=[{ 'InstanceType':'ml.m5.xlarge', 'InitialInstanceCount':1, 'InitialVariantWeight':1, 'ModelName':model_name, 'VariantName':'AllTrafficVariant' }], DataCaptureConfig = data_capture_configuration) On an existing endpoint, I would have used the UpdateEndpoint API to seamlessly update the endpoint configuration. After invoking the endpoint repeatedly, I can see some captured data in S3 (output was edited for clarity). $ aws s3 ls --recursive s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/datacapture/DEMO-xgb-churn-pred-model-monitor-2019-11-22-07-59-33/ AllTrafficVariant/2019/11/22/08/24-40-519-9a9273ca-09c2-45d3-96ab-fc7be2402d43.jsonl AllTrafficVariant/2019/11/22/08/25-42-243-3e1c653b-8809-4a6b-9d51-69ada40bc809.jsonl Here’s a line from one of these files. "endpointInput":{ "observedContentType":"text/csv", "mode":"INPUT", "data":"132,25,113.2,96,269.9,107,229.1,87,7.1,7,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1", "encoding":"CSV" }, "endpointOutput":{ "observedContentType":"text/csv; charset=utf-8", "mode":"OUTPUT", "data":"0.01076381653547287", "encoding":"CSV"} }, "eventMetadata":{ "eventId":"6ece5c74-7497-43f1-a263-4833557ffd63", "inferenceTime":"2019-11-22T08:24:40Z"}, "eventVersion":"0"} Pretty much what I expected. Now, let’s create a baseline for this model. Creating A Monitoring Baseline This is a very simple step: pass the location of the baseline data set, and the location where results should be stored. from processingjob_wrapper import ProcessingJob processing_job = ProcessingJob(sm_client, role). create(job_name, baseline_data_uri, baseline_results_uri) Once that job is complete, I can see two new objects in S3: one for statistics, and one for constraints. aws s3 ls s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/baselining/results/ constraints.json statistics.json The constraints.json file tells me about the inferred schema for the training data set (don’t forget to check it’s accurate). Each feature is typed, and I also get information on whether a feature is always present or not (1.0 means 100% here). Here are the first few lines. { "version" : 0.0, "features" : [ { "name" : "Churn", "inferred_type" : "Integral", "completeness" : 1.0 }, { "name" : "Account Length", "inferred_type" : "Integral", "completeness" : 1.0 }, { "name" : "VMail Message", "inferred_type" : "Integral", "completeness" : 1.0 }, { "name" : "Day Mins", "inferred_type" : "Fractional", "completeness" : 1.0 }, { "name" : "Day Calls", "inferred_type" : "Integral", "completeness" : 1.0 At the end of that file, I can see configuration information for CloudWatch monitoring: turn it on or off, set the drift threshold, etc. "monitoring_config" : { "evaluate_constraints" : "Enabled", "emit_metrics" : "Enabled", "distribution_constraints" : { "enable_comparisons" : true, "min_domain_mass" : 1.0, "comparison_threshold" : 1.0 } } The statistics.json file shows different statistics for each feature (mean, average, quantiles, etc.), as well as unique values received by the endpoint. Here’s an example. "name" : "Day Mins", "inferred_type" : "Fractional", "numerical_statistics" : { "common" : { "num_present" : 2333, "num_missing" : 0 }, "mean" : 180.22648949849963, "sum" : 420468.3999999996, "std_dev" : 53.987178959901556, "min" : 0.0, "max" : 350.8, "distribution" : { "kll" : { "buckets" : [ { "lower_bound" : 0.0, "upper_bound" : 35.08, "count" : 14.0 }, { "lower_bound" : 35.08, "upper_bound" : 70.16, "count" : 48.0 }, { "lower_bound" : 70.16, "upper_bound" : 105.24000000000001, "count" : 130.0 }, { "lower_bound" : 105.24000000000001, "upper_bound" : 140.32, "count" : 318.0 }, { "lower_bound" : 140.32, "upper_bound" : 175.4, "count" : 565.0 }, { "lower_bound" : 175.4, "upper_bound" : 210.48000000000002, "count" : 587.0 }, { "lower_bound" : 210.48000000000002, "upper_bound" : 245.56, "count" : 423.0 }, { "lower_bound" : 245.56, "upper_bound" : 280.64, "count" : 180.0 }, { "lower_bound" : 280.64, "upper_bound" : 315.72, "count" : 58.0 }, { "lower_bound" : 315.72, "upper_bound" : 350.8, "count" : 10.0 } ], "sketch" : { "parameters" : { "c" : 0.64, "k" : 2048.0 }, "data" : [ [ 178.1, 160.3, 197.1, 105.2, 283.1, 113.6, 232.1, 212.7, 73.3, 176.9, 161.9, 128.6, 190.5, 223.2, 157.9, 173.1, 273.5, 275.8, 119.2, 174.6, 133.3, 145.0, 150.6, 220.2, 109.7, 155.4, 172.0, 235.6, 218.5, 92.7, 90.7, 162.3, 146.5, 210.1, 214.4, 194.4, 237.3, 255.9, 197.9, 200.2, 120, ... Now, let’s start monitoring our endpoint. Monitoring An Endpoint Again, one API call is all that it takes: I simply create a monitoring schedule for my endpoint, passing the constraints and statistics file for the baseline data set. Optionally, I could also pass preprocessing and postprocessing functions, should I want to tweak data and predictions. ms = MonitoringSchedule(sm_client, role) schedule = ms.create( mon_schedule_name, endpoint_name, s3_report_path, # record_preprocessor_source_uri=s3_code_preprocessor_uri, # post_analytics_source_uri=s3_code_postprocessor_uri, baseline_statistics_uri=baseline_results_uri + '/statistics.json', baseline_constraints_uri=baseline_results_uri+ '/constraints.json' ) Then, I start sending bogus data to the endpoint, i.e. samples constructed from random values, and I wait for SageMaker Model Monitor to start generating reports. The suspense is killing me! Inspecting Reports Quickly, I see that reports are available in S3. mon_executions = sm_client.list_monitoring_executions(MonitoringScheduleName=mon_schedule_name, MaxResults=3) for execution_summary in mon_executions['MonitoringExecutionSummaries']: print("ProcessingJob: {}".format(execution_summary['ProcessingJobArn'].split('/')[1])) print('MonitoringExecutionStatus: {} \n'.format(execution_summary['MonitoringExecutionStatus'])) ProcessingJob: model-monitoring-201911221050-df2c7fc4 MonitoringExecutionStatus: Completed ProcessingJob: model-monitoring-201911221040-3a738dd7 MonitoringExecutionStatus: Completed ProcessingJob: model-monitoring-201911221030-83f15fb9 MonitoringExecutionStatus: Completed Let’s find the reports for one of these monitoring jobs. desc_analytics_job_result=sm_client.describe_processing_job(ProcessingJobName=job_name) report_uri=desc_analytics_job_result['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri'] print('Report Uri: {}'.format(report_uri)) Report Uri: s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/reports/2019112208-2019112209 Ok, so what do we have here? aws s3 ls s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/reports/2019112208-2019112209/ constraint_violations.json constraints.json statistics.json As you would expect, the constraints.json and statistics.json contain schema and statistics information on the data samples processed by the monitoring job. Let’s open directly the third one, constraints_violations.json! violations" : [ { "feature_name" : "State_AL", "constraint_check_type" : "data_type_check", "description" : "Value: 0.8 does not meet the constraint requirement! " }, { "feature_name" : "Eve Mins", "constraint_check_type" : "baseline_drift_check", "description" : "Numerical distance: 0.2711598746081505 exceeds numerical threshold: 0" }, { "feature_name" : "CustServ Calls", "constraint_check_type" : "baseline_drift_check", "description" : "Numerical distance: 0.6470588235294117 exceeds numerical threshold: 0" } Oops! It looks like I’ve been assigning floating point values to integer features: surely that’s not going to work too well! Some features are also exhibiting drift, that’s not good either. Maybe something is wrong with my data ingestion process, or maybe the distribution of data has actually changed, and I need to retrain the model. As all this information is available as CloudWatch metrics, I could define thresholds, set alarms and even trigger new training jobs automatically. Now Available! As you can see, Amazon SageMaker Model Monitor is easy to set up, and helps you quickly know about quality issues in your ML models. Now it’s your turn: you can start using Amazon SageMaker Model Monitor today in all commercial regions where Amazon SageMaker is available. This capability is also integrated in Amazon SageMaker Studio, our workbench for ML projects. Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts. - Julien

Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation

Amazon Web Services Blog -

Today, we’re extremely happy to launch Amazon SageMaker Processing, a new capability of Amazon SageMaker that lets you easily run your preprocessing, postprocessing and model evaluation workloads on fully managed infrastructure. Training an accurate machine learning (ML) model requires many different steps, but none is potentially more important than preprocessing your data set, e.g.: Converting the data set to the input format expected by the ML algorithm you’re using, Transforming existing features to a more expressive representation, such as one-hot encoding categorical features, Rescaling or normalizing numerical features, Engineering high level features, e.g. replacing mailing addresses with GPS coordinates, Cleaning and tokenizing text for natural language processing applications, And more! These tasks involve running bespoke scripts on your data set, (beneath a moonless sky, I’m told) and saving the processed version for later use by your training jobs. As you can guess, running them manually or having to build and scale automation tools is not an exciting prospect for ML teams. The same could be said about postprocessing jobs (filtering, collating, etc.) and model evaluation jobs (scoring models against different test sets). Solving this problem is why we built Amazon SageMaker Processing. Let me tell you more. Introducing Amazon SageMaker Processing Amazon SageMaker Processing introduces a new Python SDK that lets data scientists and ML engineers easily run preprocessing, postprocessing and model evaluation workloads on Amazon SageMaker. This SDK uses SageMaker’s built-in container for scikit-learn, possibly the most popular library one for data set transformation. If you need something else, you also have the ability to use your own Docker images without having to conform to any Docker image specification: this gives you maximum flexibility in running any code you want, whether on SageMaker Processing, on AWS container services like Amazon ECS and Amazon Elastic Kubernetes Service, or even on premise. How about a quick demo with scikit-learn? Then, I’ll briefly discuss using your own container. Of course, you’ll find complete examples on Github. Preprocessing Data With The Built-In Scikit-Learn Container Here’s how to use the SageMaker Processing SDK to run your scikit-learn jobs. First, let’s create an SKLearnProcessor object, passing the scikit-learn version we want to use, as well as our managed infrastructure requirements. from sagemaker.sklearn.processing import SKLearnProcessor sklearn_processor = SKLearnProcessor(framework_version='0.20.0', role=role, instance_count=1, instance_type='ml.m5.xlarge') Then, we can run our preprocessing script (more on this fellow in a minute) like so: The data set (dataset.csv) is automatically copied inside the container under the destination directory (/input). We could add additional inputs if needed. This is where the Python script (preprocessing.py) reads it. Optionally, we could pass command line arguments to the script. It preprocesses it, splits it three ways, and saves the files inside the container under /opt/ml/processing/output/train, /opt/ml/processing/output/validation, and /opt/ml/processing/output/test. Once the job completes, all outputs are automatically copied to your default SageMaker bucket in S3. from sagemaker.processing import ProcessingInput, ProcessingOutput sklearn_processor.run( code='preprocessing.py', # arguments = ['arg1', 'arg2'], inputs=[ProcessingInput( source='dataset.csv', destination='/opt/ml/processing/input')], outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'), ProcessingOutput(source='/opt/ml/processing/output/validation'), ProcessingOutput(source='/opt/ml/processing/output/test')] ) That’s it! Let’s put everything together by looking at the skeleton of the preprocessing script. import pandas as pd from sklearn.model_selection import train_test_split # Read data locally df = pd.read_csv('/opt/ml/processing/input/dataset.csv') # Preprocess the data set downsampled = apply_mad_data_science_skills(df) # Split data set into training, validation, and test train, test = train_test_split(downsampled, test_size=0.2) train, validation = train_test_split(train, test_size=0.2) # Create local output directories try: os.makedirs('/opt/ml/processing/output/train') os.makedirs('/opt/ml/processing/output/validation') os.makedirs('/opt/ml/processing/output/test') except: pass # Save data locally train.to_csv("/opt/ml/processing/output/train/train.csv") validation.to_csv("/opt/ml/processing/output/validation/validation.csv") test.to_csv("/opt/ml/processing/output/test/test.csv") print('Finished running processing job') A quick look to the S3 bucket confirms that files have been successfully processed and saved. Now I could use them directly as input for a SageMaker training job. $ aws s3 ls --recursive s3://sagemaker-us-west-2-123456789012/sagemaker-scikit-learn-2019-11-20-13-57-17-805/output 2019-11-20 15:03:22 19967 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/test.csv 2019-11-20 15:03:22 64998 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/train.csv 2019-11-20 15:03:22 18058 sagemaker-scikit-learn-2019-11-20-13-57-17-805/output/validation.csv Now what about using your own container? Processing Data With Your Own Container Let’s say you’d like to preprocess text data with the popular spaCy library. Here’s how you could define a vanilla Docker container for it. FROM python:3.7-slim-buster # Install spaCy, pandas, and an english language model for spaCy. RUN pip3 install spacy==2.2.2 && pip3 install pandas==0.25.3 RUN python3 -m spacy download en_core_web_md # Make sure python doesn't buffer stdout so we get logs ASAP. ENV PYTHONUNBUFFERED=TRUE ENTRYPOINT ["python3"] Then, you would build the Docker container, test it locally, and push it to Amazon Elastic Container Registry, our managed Docker registry service. The next step would be to configure a processing job using the ScriptProcessor object, passing the name of the container you built and pushed. from sagemaker.processing import ScriptProcessor script_processor = ScriptProcessor(image_uri='123456789012.dkr.ecr.us-west-2.amazonaws.com/sagemaker-spacy-container:latest', role=role, instance_count=1, instance_type='ml.m5.xlarge') Finally, you would run the job just like in the previous example. script_processor.run(code='spacy_script.py', inputs=[ProcessingInput( source='dataset.csv', destination='/opt/ml/processing/input_data')], outputs=[ProcessingOutput(source='/opt/ml/processing/processed_data')], arguments=['tokenizer', 'lemmatizer', 'pos-tagger'] ) The rest of the process is exactly the same as above: copy the input(s) inside the container, copy the output(s) from the container to S3. Pretty simple, don’t you think? Again, I focused on preprocessing, but you can run similar jobs for postprocessing and model evaluation. Don’t forget to check out the examples in Github. Now Available! Amazon SageMaker Processing is available today in all commercial AWS Regions where Amazon SageMaker is available. Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts. — Julien

Amazon SageMaker Autopilot – Automatically Create High-Quality Machine Learning Models With Full Control And Visibility

Amazon Web Services Blog -

Today, we’re extremely happy to launch Amazon SageMaker Autopilot to automatically create the best classification and regression machine learning models, while allowing full control and visibility. In 1959, Arthur Samuel defined machine learning as the ability for computers to learn without being explicitly programmed. In practice, this means finding an algorithm than can extract patterns from an existing data set, and use these patterns to build a predictive model that will generalize well to new data. Since then, lots of machine learning algorithms have been invented, giving scientists and engineers plenty of options to choose from, and helping them build amazing applications. However, this abundance of algorithms also creates a difficulty: which one should you pick? How can you reliably figure out which one will perform best on your specific business problem? In addition, machine learning algorithms usually have a long list of training parameters (also called hyperparameters) that need to be set “just right” if you want to squeeze every bit of extra accuracy from your models. To make things worse, algorithms also require data to be prepared and transformed in specific ways (aka feature engineering) for optimal learning… and you need to pick the best instance type. If you think this sounds like a lot of experimental, trial and error work, you’re absolutely right. Machine learning is definitely of mix of hard science and cooking recipes, making it difficult for non-experts to get good results quickly. What if you could rely on a fully managed service to solve that problem for you? Call an API and get the job done? Enter Amazon SageMaker Autopilot. Introducing Amazon SageMaker Autopilot Using a single API call, or a few clicks in Amazon SageMaker Studio, SageMaker Autopilot first inspects your data set, and runs a number of candidates to figure out the optimal combination of data preprocessing steps, machine learning algorithms and hyperparameters. Then, it uses this combination to train an Inference Pipeline, which you can easily deploy either on a real-time endpoint or for batch processing. As usual with Amazon SageMaker, all of this takes place on fully-managed infrastructure. Last but not least, SageMaker Autopilot also generate Python code showing you exactly how data was preprocessed: not only can you understand what SageMaker Autopilot did, you can also reuse that code for further manual tuning if you’re so inclined. As of today, SageMaker Autopilot supports: Input data in tabular format, with automatic data cleaning and preprocessing, Automatic algorithm selection for linear regression, binary classification, and multi-class classification, Automatic hyperparameter optimization, Distributed training, Automatic instance and cluster size selection. Let me show you how simple this is. Using AutoML with Amazon SageMaker Autopilot Let’s use this sample notebook as a starting point: it builds a binary classification model predicting if customers will accept or decline a marketing offer. Please take a few minutes to read it: as you will see, the business problem itself is easy to understand, and the data set is neither large or complicated. Yet, several non-intuitive preprocessing steps are required, and there’s also the delicate matter of picking an algorithm and its parameters… SageMaker Autopilot to the rescue! First, I grab a copy of the data set, and take a quick look at the first few lines. Then, I upload it in Amazon Simple Storage Service (S3) without any preprocessing whatsoever. sess.upload_data(path="automl-train.csv", key_prefix=prefix + "/input") 's3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-automl-dm/input/automl-train.csv' Now, let’s configure the AutoML job: Set the location of the data set, Select the target attribute that I want the model to predict: in this case, it’s the ‘y’ column showing if a customer accepted the offer or not, Set the location of training artifacts. input_data_config = [{ 'DataSource': { 'S3DataSource': { 'S3DataType': 'S3Prefix', 'S3Uri': 's3://{}/{}/input'.format(bucket,prefix) } }, 'TargetAttributeName': 'y' } ] output_data_config = { 'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix) } That’s it! Of course, SageMaker Autopilot has a number of options that will come in handy as you learn more about your data and your models, e.g.: Set the type of problem you want to train on: linear regression, binary classification, or multi-class classification. If you’re not sure, SageMaker Autopilot will figure it out automatically by analyzing the values of the target attribute. Use a specific metric for model evaluation. Define completion criteria: maximum running time, etc. One thing I don’t have to do is size the training cluster, as SageMaker Autopilot uses a heuristic based on data size and algorithm. Pretty cool! With configuration out of the way, I can fire up the job with the CreateAutoMl API. auto_ml_job_name = 'automl-dm-' + timestamp_suffix print('AutoMLJobName: ' + auto_ml_job_name) sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name, InputDataConfig=input_data_config, OutputDataConfig=output_data_config, RoleArn=role) AutoMLJobName: automl-dm-28-10-17-49 A job runs in four steps (you can use the DescribeAutoMlJob API to view them). Splitting the data set into train and validation sets, Analyzing data, in order to recommend pipelines that should be tried out on the data set, Feature engineering, where transformations are applied to the data set and to individual features,  Pipeline selection and hyperparameter tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm. Once the maximum number of candidates – or one of the stopping conditions – has been reached, the job is complete. I can get detailed information on all candidates using the ListCandidatesForAutoMlJob API , and also view them in the AWS console. candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates'] index = 1 for candidate in candidates: print (str(index) + " " + candidate['CandidateName'] + " " + str(candidate['FinalAutoMLJobObjectiveMetric']['Value'])) index += 1 1 automl-dm-28-tuning-job-1-fabb8-001-f3b6dead 0.9186699986457825 2 automl-dm-28-tuning-job-1-fabb8-004-03a1ff8a 0.918304979801178 3 automl-dm-28-tuning-job-1-fabb8-003-c443509a 0.9181839823722839 4 automl-dm-28-tuning-job-1-ed07c-006-96f31fde 0.9158779978752136 5 automl-dm-28-tuning-job-1-ed07c-004-da2d99af 0.9130859971046448 6 automl-dm-28-tuning-job-1-ed07c-005-1e90fd67 0.9130859971046448 7 automl-dm-28-tuning-job-1-ed07c-008-4350b4fa 0.9119930267333984 8 automl-dm-28-tuning-job-1-ed07c-007-dae75982 0.9119930267333984 9 automl-dm-28-tuning-job-1-ed07c-009-c512379e 0.9119930267333984 10 automl-dm-28-tuning-job-1-ed07c-010-d905669f 0.8873512744903564 For now, I’m only interested in the best trial: 91.87% validation accuracy. Let’s deploy it to a SageMaker endpoint, just like we would deploy any model: Create a model, Create an endpoint configuration, Create the endpoint. model_arn = sm.create_model(Containers=best_candidate['InferenceContainers'], ModelName=model_name, ExecutionRoleArn=role) ep_config = sm.create_endpoint_config(EndpointConfigName = epc_name, ProductionVariants=[{'InstanceType':'ml.m5.2xlarge', 'InitialInstanceCount':1, 'ModelName':model_name, 'VariantName':variant_name}]) create_endpoint_response = sm.create_endpoint(EndpointName=ep_name, EndpointConfigName=epc_name) After a few minutes, the endpoint is live, and I can use it for prediction. SageMaker business as usual! Now, I bet you’re curious about how the model was built, and what the other candidates are. Let me show you. Full Visibility And Control with Amazon SageMaker Autopilot SageMaker Autopilot stores training artifacts in S3, including two auto-generated notebooks! job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name) job_data_notebook = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation'] job_candidate_notebook = job['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation'] print(job_data_notebook) print(job_candidate_notebook) s3://<PREFIX_REMOVED>/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb s3://<PREFIX_REMOVED>/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb The first one contains information about the data set. The second one contains full details on the SageMaker Autopilot job: candidates, data preprocessing steps, etc. All code is available, as well as ‘knobs’ you can change for further experimentation. As you can see, you have full control and visibility on how models are built. Now Available! I’m very excited about Amazon SageMaker Autopilot, because it’s making machine learning simpler and more accessible than ever. Whether you’re just beginning with machine learning, or whether you’re a seasoned practitioner, SageMaker Autopilot will help you build better models quicker using either one of these paths: Easy no-code path in Amazon SageMaker Studio, Easy code path with the SageMaker Autopilot SDK, In-depth path with candidate generation notebook. Now it’s your turn. You can start using SageMaker Autopilot today in the following regions: US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Canada (Central), South America (São Paulo), Europe (Ireland), Europe (London), Europe (Paris), Europe (Frankfurt), Middle East (Bahrain), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo). Please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts. — Julien

Amazon SageMaker Experiments – Organize, Track And Compare Your Machine Learning Trainings

Amazon Web Services Blog -

Today, we’re extremely happy to announce Amazon SageMaker Experiments, a new capability of Amazon SageMaker that lets you organize, track, compare and evaluate machine learning (ML) experiments and model versions. ML is a highly iterative process. During the course of a single project, data scientists and ML engineers routinely train thousands of different models in search of maximum accuracy. Indeed, the number of combinations for algorithms, data sets, and training parameters (aka hyperparameters) is infinite… and therein lies the proverbial challenge of finding a needle in a haystack. Tools like Automatic Model Tuning and Amazon SageMaker Autopilot help ML practitioners explore a large number of combinations automatically, and quickly zoom in on high-performance models. However, they further add to the explosive growth of training jobs. Over time, this creates a new difficulty for ML teams, as it becomes near-impossible to efficiently deal with hundreds of thousands of jobs: keeping track of metrics, grouping jobs by experiment, comparing jobs in the same experiment or across experiments, querying past jobs, etc. Of course, this can be solved by building, managing and scaling bespoke tools: however, doing so diverts valuable time and resources away from actual ML work. In the spirit of helping customers focus on ML and nothing else, we couldn’t leave this problem unsolved. Introducing Amazon SageMaker Experiments First, let’s define core concepts: A trial is a collection of training steps involved in a single training job. Training steps typically includes preprocessing, training, model evaluation, etc. A trial is also enriched with metadata for inputs (e.g. algorithm, parameters, data sets) and outputs (e.g. models, checkpoints, metrics). An experiment is simply a collection of trials, i.e. a group of related training jobs. The goal of SageMaker Experiments is to make it as simple as possible to create experiments, populate them with trials, and run analytics across trials and experiments. For this purpose, we introduce a new Python SDK containing logging and analytics APIs. Running your training jobs on SageMaker or SageMaker Autopilot, all you have to do is pass an extra parameter to the Estimator, defining the name of the experiment that this trial should be attached to. All inputs and outputs will be logged automatically. Once you’ve run your training jobs, the SageMaker Experiments SDK lets you load experiment and trial data in the popular pandas dataframe format. Pandas truly is the Swiss army knife of ML practitioners, and you’ll be able to perform any analysis that you may need. Go one step further by building cool visualizations with matplotlib, and you’ll be well on your way to taming that wild horde of training jobs! As you would expect, SageMaker Experiments is nicely integrated in Amazon SageMaker Studio. You can run complex queries to quickly find the past trial you’re looking for. You can also visualize real-time model leaderboards and metric charts. How about a quick demo? Logging Training Information With Amazon SageMaker Experiments Let’s start from a PyTorch script classifying images from the MNIST data set, using a simple two-layer convolution neural network (CNN). If I wanted to run a single job on SageMaker, I could use the PyTorch estimator like so: estimator = PyTorch( entry_point='mnist.py', role=role, sagemaker_session=sess framework_version='1.1.0', train_instance_count=1, train_instance_type='ml.p3.2xlarge') estimator.fit(inputs={'training': inputs}) Instead, let’s say that I want to run multiple versions of the same script, changing only one of the hyperparameters (the number of convolution filters used by the two convolution layers, aka number of hidden channels) to measure its impact on model accuracy. Of course, we could run these jobs, grab the training logs, extract metrics with fancy text filtering, etc. Or we could use SageMaker Experiments! All I need to do is: Set up an experiment, Use a tracker to log experiment metadata, Create a trial for each training job I want to run, Run each training job, passing parameters for the experiment name and the trial name. First things first, let’s take care of the experiment. from smexperiments.experiment import Experiment mnist_experiment = Experiment.create( experiment_name="mnist-hand-written-digits-classification", description="Classification of mnist hand-written digits", sagemaker_boto_client=sm) Then, let’s add a few things that we want to keep track of, like the location of the data set and normalization values we applied to it. from smexperiments.tracker import Tracker with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker: tracker.log_input(name="mnist-dataset", media_type="s3/uri", value=inputs) tracker.log_parameters({ "normalization_mean": 0.1307, "normalization_std": 0.3081, }) Now let’s run a few jobs. I simply loop over the different values that I want to try, creating a new trial for each training job and adding the tracker information to it. for i, num_hidden_channel in enumerate([2, 5, 10, 20, 32]): trial_name = f"cnn-training-job-{num_hidden_channel}-hidden-channels-{int(time.time())}" cnn_trial = Trial.create( trial_name=trial_name, experiment_name=mnist_experiment.experiment_name, sagemaker_boto_client=sm, ) cnn_trial.add_trial_component(tracker.trial_component) Then, I configure the estimator, passing the value for the hyperparameter I’m interested in, and leaving the other ones as is. I’m also passing regular expressions to extract metrics from the training log. All these will push stored in the trial: in fact, all parameters (passed or default) will be.   estimator = PyTorch( entry_point='mnist.py', role=role, sagemaker_session=sess, framework_version='1.1.0', train_instance_count=1, train_instance_type='ml.p3.2xlarge', hyperparameters={ 'hidden_channels': num_hidden_channels }, metric_definitions=[ {'Name':'train:loss', 'Regex':'Train Loss: (.*?);'}, {'Name':'test:loss', 'Regex':'Test Average loss: (.*?),'}, {'Name':'test:accuracy', 'Regex':'Test Accuracy: (.*?)%;'} ] ) Finally, I run the training job, associating it to the experiment and the trial. cnn_training_job_name = "cnn-training-job-{}".format(int(time.time())) estimator.fit( inputs={'training': inputs}, job_name=cnn_training_job_name, experiment_config={ "ExperimentName": mnist_experiment.experiment_name, "TrialName": cnn_trial.trial_name, "TrialComponentDisplayName": "Training", } ) # end of loop Once all jobs are complete, I can run analytics. Let’s find out how we did. Analytics with Amazon SageMaker Experiments All information on an experiment can be easily exported to a Pandas DataFrame. from sagemaker.analytics import ExperimentAnalytics trial_component_analytics = ExperimentAnalytics( sagemaker_session=sess, experiment_name=mnist_experiment.experiment_name ) analytic_table = trial_component_analytics.dataframe() If I want to drill down, I can specify additional parameters, e.g.: trial_component_analytics = ExperimentAnalytics( sagemaker_session=sess, experiment_name=mnist_experiment.experiment_name, sort_by="metrics.test:accuracy.max", sort_order="Descending", metric_names=['test:accuracy'], parameter_names=['hidden_channels', 'epochs', 'dropout', 'optimizer'] ) analytic_table = trial_component_analytics.dataframe() This builds a DataFrame where trials are sorted by decreasing test accuracy, and showing only some of the hyperparameters for each trial. for col in analytic_table.columns: print(col) TrialComponentName DisplayName SourceArn dropout epochs hidden_channels optimizer test:accuracy - Min test:accuracy - Max test:accuracy - Avg test:accuracy - StdDev test:accuracy - Last test:accuracy - Count From here on, your imagination is the limit. Pandas is the Swiss army knife of data analysis, and you’ll be able to compare trials and experiments in every possible way. Last but not least, thanks to the integration with Amazon SageMaker Studio, you’ll be able to visualize all this information in real-time with predefined widgets. To learn more about Amazon SageMaker Studio, visit this blog post. Now Available! I just scratched the surface of what you can do with Amazon SageMaker Experiments, and I believe it will help you tame the wild horde of jobs that you have to deal with everyday. The service is available today in all commercial AWS Regions where Amazon SageMaker is available. Give it a try and please send us feedback, either in the AWS forum for Amazon SageMaker, or through your usual AWS contacts. - Julien  

Now Available on Amazon SageMaker: The Deep Graph Library

Amazon Web Services Blog -

Today, we’re happy to announce that the Deep Graph Library, an open source library built for easy implementation of graph neural networks, is now available on Amazon SageMaker. In recent years, Deep learning has taken the world by storm thanks to its uncanny ability to extract elaborate patterns from complex data, such as free-form text, images, or videos. However, lots of datasets don’t fit these categories and are better expressed with graphs. Intuitively, we can feel that traditional neural network architectures like convolution neural networks or recurrent neural networks are not a good fit for such datasets, and a new approach is required. A Primer On Graph Neural Networks Graph neural networks (GNN) are one of the most exciting developments in machine learning today, and these reference papers will get you started. GNNs are used to train predictive models on datasets such as: Social networks, where graphs show connections between related people, Recommender systems, where graphs show interactions between customers and items, Chemical analysis, where compounds are modeled as graphs of atoms and bonds, Cybersecurity, where graphs describe connections between source and destination IP addresses, And more! Most of the time, these datasets are extremely large and only partially labeled. Consider a fraud detection scenario where we would try to predict the likelihood that an individual is a fraudulent actor by analyzing his connections to known fraudsters. This problem could be defined as a semi-supervised learning task, where only a fraction of graph nodes would be labeled (‘fraudster’ or ‘legitimate’). This should be a better solution than trying to build a large hand-labeled dataset, and “linearizing” it to apply traditional machine learning algorithms. Working on these problems requires domain knowledge (retail, finance, chemistry, etc.), computer science knowledge (Python, deep learning, open source tools), and infrastructure knowledge (training, deploying, and scaling models). Very few people master all these skills, which is why tools like the Deep Graph Library and Amazon SageMaker are needed. Introducing The Deep Graph Library First released on Github in December 2018, the Deep Graph Library (DGL) is a Python open source library that helps researchers and scientists quickly build, train, and evaluate GNNs on their datasets. DGL is built on top of popular deep learning frameworks like PyTorch and Apache MXNet. If you know either one or these, you’ll find yourself quite at home. No matter which framework you use, you can get started easily thanks to these beginner-friendly examples. I also found the slides and code for the GTC 2019 workshop very useful. Once you’re done with toy examples, you can start exploring the collection of cutting edge models already implemented in DGL. For example, you can train a document classification model using a Graph Convolution Network (GCN) and the CORA dataset by simply running: $ python3 train.py --dataset cora --gpu 0 --self-loop The code for all models is available for inspection and tweaking. These implementations have been carefully validated by AWS teams, who verified performance claims and made sure results could be reproduced. DGL also includes a collection of graph datasets, that you can easily download and experiment with. Of course, you can install and run DGL locally, but to make your life simpler, we added it to the Deep Learning Containers for PyTorch and Apache MXNet. This makes it easy to use DGL on Amazon SageMaker, in order to train and deploy models at any scale, without having to manage a single server. Let me show you how. Using DGL On Amazon SageMaker We added complete examples in the Github repository for SageMaker examples: one of them trains a simple GNN for molecular toxicity prediction using the Tox21 dataset. The problem we’re trying to solve is figuring it the potential toxicity of new chemical compounds with respect to 12 different targets (receptors inside biological cells, etc.). As you can imagine, this type of analysis is crucial when designing new drugs, and being able to quickly predict results without having to run in vitro experiments helps researchers focus their efforts on the most promising drug candidates. The dataset contains a little over 8,000 compounds: each one is modeled as a graph (atoms are vertices, atomic bonds are edges), and labeled 12 times (one label per target). Using a GNN, we’re going to build a multi-label binary classification model, allowing us to predict the potential toxicity of candidate molecules. In the training script, we can easily download the dataset from the DGL collection. from dgl.data.chem import Tox21 dataset = Tox21() Similarly, we can easily build a GNN classifier using the DGL model zoo. from dgl import model_zoo model = model_zoo.chem.GCNClassifier(     in_feats=args['n_input'],     gcn_hidden_feats=[args['n_hidden'] for _ in range(args['n_layers'])],     n_tasks=dataset.n_tasks,     classifier_hidden_feats=args['n_hidden']).to(args['device']) The rest of the code is mostly vanilla PyTorch, and you should be able to find your bearings if you’re familiar with this library. When it comes to running this code on Amazon SageMaker, all we have to do is use a SageMaker Estimator, passing the full name of our DGL container, and the name of the training script as a hyperparameter. estimator = sagemaker.estimator.Estimator(container,     role,     train_instance_count=1,     train_instance_type='ml.p3.2xlarge',     hyperparameters={'entrypoint': 'main.py'},     sagemaker_session=sess) code_location = sess.upload_data(CODE_PATH, bucket=bucket, key_prefix=custom_code_upload_location) estimator.fit({'training-code': code_location}) <output removed> epoch 23/100, batch 48/49, loss 0.4684 epoch 23/100, batch 49/49, loss 0.5389 epoch 23/100, training roc-auc 0.9451 EarlyStopping counter: 10 out of 10 epoch 23/100, validation roc-auc 0.8375, best validation roc-auc 0.8495 Best validation score 0.8495 Test score 0.8273 2019-11-21 14:11:03 Uploading - Uploading generated training model 2019-11-21 14:11:03 Completed - Training job completed Training seconds: 209 Billable seconds: 209 Now, we could grab the trained model in S3, and use it to predict toxicity for large number of compounds, without having to run actual experiments. Fascinating stuff! Now Available! You can start using DGL on Amazon SageMaker today. Give it a try, and please send us feedback in the DGL forum, in the AWS forum for Amazon SageMaker, or through your usual AWS support contacts. – Julien  

New – Amazon Managed Apache Cassandra Service (MCS)

Amazon Web Services Blog -

Managing databases at scale is never easy. One of the options to store, retrieve, and manage large amounts of structured data, including key-value and tabular formats, is Apache Cassandra. With Cassandra, you can use the expressive Cassandra Query Language (CQL) to build applications quickly. However, managing large Cassandra clusters can be difficult and takes a lot of time. You need specialized expertise to set up, configure, and maintain the underlying infrastructure, and have a deep understanding of the entire application stack, including the Apache Cassandra open source software. You need to add or remove nodes manually, rebalancing partitions, and doing so while keeping your application available with the required performance. Talking with customers, we found out that they often keep their clusters scaled up for peak load because scaling down is complex. To keep your Cassandra cluster updated, you have to do it node by node. It’s hard to backup and restore a cluster if something goes wrong during an update, and you may end up skipping patches or running an outdated version. Introducing Amazon Managed Cassandra Service Today, we are launching in open preview Amazon Managed Apache Cassandra Service (MCS), a scalable, highly available, and managed Apache Cassandra-compatible database service. Amazon MCS is serverless, so you pay for only the resources you use and the service automatically scales tables up and down in response to application traffic. You can build applications that serve thousands of requests per second with virtually unlimited throughput and storage. With Amazon MCS, you can run your Cassandra workloads on AWS using the same Cassandra application code and developer tools that you use today. Amazon MCS implements the Apache Cassandra version 3.11 CQL API, allowing you to use the code and drivers that you already have in your applications. Updating your application is as easy as changing the endpoint to the one in the Amazon MCS service table. Amazon MCS provides consistent single-digit-millisecond read and write performance at any scale, so you can build applications with low latency to provide a smooth user experience. You have visibility into how your application is performing using Amazon CloudWatch. There is no limit on the size of a table or the number of items, and you do not need to provision storage. Data storage is fully managed and highly available. Your table data is replicated automatically three times across multiple AWS Availability Zones for durability. All customer data is encrypted at rest by default. You can use encryption keys stored in AWS Key Management Service (KMS). Amazon MCS is also integrated with AWS Identity and Access Management (IAM) to help you manage access to your tables and data. Using Amazon Managed Cassandra Service You can use Amazon MCS with the console, CQL, or existing Apache 2.0 licensed Cassandra drivers. In the console there is a CQL editor, or you can connect using cqlsh.  To connect using cqlsh, I need to generate service-specific credentials for an existing IAM user. This is just a command using the AWS Command Line Interface (CLI): aws iam create-service-specific-credential --user-name USERNAME --service-name cassandra.amazonaws.com { "ServiceSpecificCredential": { "CreateDate": "2019-11-27T14:36:16Z", "ServiceName": "cassandra.amazonaws.com", "ServiceUserName": "USERNAME-at-123412341234", "ServicePassword": "...", "ServiceSpecificCredentialId": "...", "UserName": "USERNAME", "Status": "Active" } } Amazon MCS only accepts secure connections using TLS.  I download the Amazon root certificate and edit the cqlshrc configuration file to use it. Now, I can connect with: cqlsh {endpoint} {port} -u {ServiceUserName} -p {ServicePassword} --ssl First, I create a keyspace. A keyspace contains one or more tables and defines the replication strategy for all the tables it contains. With Amazon MCS the default replication strategy for all keyspaces is the Single-region strategy. It replicates data 3 times across multiple Availability Zones in a single AWS Region. To create a keyspace I can use the console or CQL. In the Amazon MCS console, I provide the name for the keyspace. Similarly, I can use CQL to create the bookstore keyspace: CREATE KEYSPACE IF NOT EXISTS bookstore WITH REPLICATION={'class': 'SingleRegionStrategy'}; Now I create a table. A table is where your data is organized and stored. Again, I can use the console or CQL. From the console, I select the bookstore keyspace and give the table a name. Below that, I add the columns for my books table. Each row in a table is referenced by a primary key, that can be composed of one or more columns, the values of which determine which partition the data is stored in. In my case the primary key is the ISBN. Optionally, I can add clustering columns, which determine the sort order of records within a partition. I am not using clustering columns for this table. Alternatively, using CQL, I can create the table with the following commands: USE bookstore; CREATE TABLE IF NOT EXISTS books (isbn text PRIMARY KEY, title text, author text, pages int, year_of_publication int); I now use CQL to insert a record in the books table: INSERT INTO books (isbn, title, author, pages, year_of_publication) VALUES ('978-0201896831', 'The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)', 'Donald E. Knuth', 672, 1997); Let’s run a quick query. In the console, I select the books table and then Query table. In the CQL Editor, I use the default query and select Run command. By default, I see the result of the query in table view: If I prefer, I can see the result in JSON format, similar to what an application using the Cassandra API would see: To insert more records, I use csqlsh again and upload some data from a local CSV file: COPY books (isbn, title, author, pages, year_of_publication) FROM './books.csv' WITH delimiter=',' AND header=TRUE; Now I look again at the content of the books table: SELECT * FROM books; I can select a row using a primary key, or use filtering for additional conditions. For example: SELECT title FROM books WHERE isbn='978-1942788713'; SELECT title FROM books WHERE author='Scott Page' ALLOW FILTERING; With Amazon MCS you can use existing Apache Cassandra 2.0–licensed drivers and developer tools. Open-source Cassandra drivers are available for Java, Python, Ruby, .NET, Node.js, PHP, C++, Perl, and Go. You can learn more in the Amazon MCS documentation. Available in Open Preview Amazon MCS is available today in open preview in US East (N. Virginia), US East (Ohio), Europe (Stockholm), Asia Pacific (Singapore), Asia Pacific (Tokyo). As we work with the Cassandra API libraries, we are contributing bug fixes to the open source Apache Cassandra project. We are also contributing back improvements such as built-in support for AWS authentication (SigV4), which simplifies managing credentials for customers running Cassandra on Amazon Elastic Compute Cloud (EC2), since EC2 and IAM can handle distribution and management of credentials using instance roles automatically. We are also announcing the funding of AWS promotional service credits for testing Cassandra-related open-source projects. To learn more about these contributions, visit the Open Source blog. During the preview, you can use Amazon MCS with on-demand capacity. At general availability, we will also offer the option to use provisioned throughput for more predictable workloads. With on-demand capacity mode, Amazon MCS charges you based on the amount of data your applications read and write from your tables. You do not need to specify how much read and write throughput capacity to provision to your tables because Amazon MCS accommodates your workloads instantly as they scale up or down. As part of the AWS Free Tier, you can get started with Amazon MCS for free. For the first three months, you are offered a monthly free tier of 30 million write request units, 30 million read request units, and 1 GB of storage. Your free tier starts when you create your first Amazon MCS resource. Next year we are making it easier to migrate your data Amazon MCS, adding support to use AWS Database Migration Service. Amazon MCS makes it easy to use Cassandra workloads at any scale, providing a simple programming interface to build new applications, or migrate existing ones. I can’t wait to see what are you going to use it for!

New for Amazon Redshift – Data Lake Export and Federated Query

Amazon Web Services Blog -

A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze data using standard SQL and existing Business Intelligence (BI) tools. To get information from unstructured data that would not fit in a data warehouse, you can build a data lake. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. With a data lake built on Amazon Simple Storage Service (S3), you can easily run big data analytics and use machine learning to gain insights from your semi-structured (such as JSON, XML) and unstructured datasets. Today, we are launching two new features to help you improve the way you manage your data warehouse and integrate with a data lake: Data Lake Export to unload data from a Redshift cluster to S3 in Apache Parquet format, an efficient open columnar storage format optimized for analytics. Federated Query to be able, from a Redshift cluster, to query across data stored in the cluster, in your S3 data lake, and in one or more Amazon Relational Database Service (RDS) for PostgreSQL and Amazon Aurora PostgreSQL databases. This architectural diagram gives a quick summary of how these features work and how they can be used together with other AWS services. Let’s explain the interactions you see in the diagram better, starting from how you can use these features, and the advantages they provide. Using Redshift Data Lake Export You can now unload the result of a Redshift query to your S3 data lake in Apache Parquet format. The Parquet format is up to 2x faster to unload and consumes up to 6x less storage in S3, compared to text formats. This enables you to save data transformation and enrichment you have done in Redshift into your S3 data lake in an open format. You can then analyze the data in your data lake with Redshift Spectrum, a feature of Redshift that allows you to query data directly from files on S3. Or you can use different tools such as Amazon Athena, Amazon EMR, or Amazon SageMaker. To try this new feature, I create a new cluster from the Redshift console, and follow this tutorial to load sample data that keeps track of sales of musical events across different venues. I want to correlate this data with social media comments on the events stored in my data lake. To understand their relevance, each event should have a way of comparing its relative sales to other events. Let’s build a query in Redshift to export the data to S3. My data is stored across multiple tables. I need to create a query that gives me a single view of what is going on with sales. I want to join the content of the  sales and date tables, adding information on the gross sales for an event (total_price in the query), and the percentile in terms of all time gross sales compared to all events. To export the result of the query to S3 in Parquet format, I use the following SQL command: UNLOAD ('SELECT sales.*, date.*, total_price, percentile FROM sales, date, (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile FROM (SELECT eventid, sum(pricepaid) total_price FROM sales GROUP BY eventid)) as percentile_events WHERE sales.dateid = date.dateid AND percentile_events.eventid = sales.eventid') TO 's3://MY-BUCKET/DataLake/Sales/' FORMAT AS PARQUET CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole'; To give Redshift write access to my S3 bucket, I am using an AWS Identity and Access Management (IAM) role. I can see the result of the UNLOAD command using the AWS Command Line Interface (CLI). As expected, the output of the query is exported using the Parquet columnar data format: $ aws s3 ls s3://MY-BUCKET/DataLake/Sales/ 2019-11-25 14:26:56 1638550 0000_part_00.parquet 2019-11-25 14:26:56 1635489 0001_part_00.parquet 2019-11-25 14:26:56 1624418 0002_part_00.parquet 2019-11-25 14:26:56 1646179 0003_part_00.parquet To optimize access to data, I can specify one or more partition columns so that unloaded data is automatically partitioned into folders in my S3 bucket. For example, I can unload sales data partitioned by year, month, and day. This enables my queries to take advantage of partition pruning and skip scanning irrelevant partitions, improving query performance and minimizing cost. To use partitioning, I need to add to the previous SQL command the PARTITION BY option, followed by the columns I want to use to partition the data in different directories. In my case, I want to partition the output based on the year and the calendar date (caldate in the query) of the sales. UNLOAD ('SELECT sales.*, date.*, total_price, percentile FROM sales, date, (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile FROM (SELECT eventid, sum(pricepaid) total_price FROM sales GROUP BY eventid)) as percentile_events WHERE sales.dateid = date.dateid AND percentile_events.eventid = sales.eventid') TO 's3://MY-BUCKET/DataLake/SalesPartitioned/' FORMAT AS PARQUET PARTITION BY (year, caldate) CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole'; This time, the output of the query is stored in multiple partitions. For example, here’s the content of a folder for a specific year and date: $ aws s3 ls s3://MY-BUCKET/DataLake/SalesPartitioned/year=2008/caldate=2008-07-20/ 2019-11-25 14:36:17 11940 0000_part_00.parquet 2019-11-25 14:36:17 11052 0001_part_00.parquet 2019-11-25 14:36:17 11138 0002_part_00.parquet 2019-11-25 14:36:18 12582 0003_part_00.parquet Optionally, I can use AWS Glue to set up a Crawler that (on demand or on a schedule) looks for data in my S3 bucket to update the Glue Data Catalog. When the Data Catalog is updated, I can easily query the data using Redshift Spectrum, Athena, or EMR. The sales data is now ready to be processed together with the unstructured and semi-structured  (JSON, XML, Parquet) data in my data lake. For example, I can now use Apache Spark with EMR, or any Sagemaker built-in algorithm to access the data and get new insights. Using Redshift Federated Query You can now also access data in RDS and Aurora PostgreSQL stores directly from your Redshift data warehouse. In this way, you can access data as soon as it is available. Straight from Redshift, you can now perform queries processing data in your data warehouse, transactional databases, and data lake, without requiring ETL jobs to transfer data to the data warehouse. Redshift leverages its advanced optimization capabilities to push down and distribute a significant portion of the computation directly into the transactional databases, minimizing the amount of data moving over the network. Using this syntax, you can add an external schema from an RDS or Aurora PostgreSQL database to a Redshift cluster: CREATE EXTERNAL SCHEMA IF NOT EXISTS online_system FROM POSTGRES DATABASE 'online_sales_db' SCHEMA 'online_system' URI ‘my-hostname' port 5432 IAM_ROLE 'iam-role-arn' SECRET_ARN 'ssm-secret-arn'; Schema and port are optional here. Schema will default to public if left unspecified and default port for PostgreSQL databases is 5432. Redshift is using AWS Secrets Manager to manage the credentials to connect to the external databases. With this command, all tables in the external schema are available and can be used by Redshift for any complex SQL query processing data in the cluster or, using Redshift Spectrum, in your S3 data lake. Coming back to the sales data example I used before, I can now correlate the trends of my historical data of musical events with real-time sales. In this way, I can understand if an event is performing as expected or not, and calibrate my marketing activities without delays. For example, after I define the online commerce database as the online_system external schema in my Redshift cluster, I can compare previous sales with what is in the online commerce system with this simple query: SELECT eventid, sum(pricepaid) total_price, sum(online_pricepaid) online_total_price FROM sales, online_system.current_sales GROUP BY eventid WHERE eventid = online_eventid; Redshift doesn’t import database or schema catalog in its entirety. When a query is run, it localizes the metadata for the Aurora and RDS tables (and views) that are part of the query. This localized metadata is then used for query compilation and plan generation. Available Now Amazon Redshift data lake export is a new tool to improve your data processing pipeline and is supported with Redshift release version 1.0.10480 or later. Refer to the AWS Region Table for Redshift availability, and check the version of your clusters. The new federation capability in Amazon Redshift is released as a public preview and allows you to bring together data stored in Redshift, S3, and one or more RDS and Aurora PostgreSQL databases. When creating a cluster in the Amazon Redshift management console, you can pick three tracks for maintenance: Current, Trailing, or Preview. Within the Preview track, preview_features should be chosen to participate to the Federated Query public preview. For example: These features simplify data processing and analytics, giving you more tools to react quickly, and a single point of view for your data. Let me know what you are going to use them for! — Danilo

Announcing UltraWarm (Preview) for Amazon Elasticsearch Service

Amazon Web Services Blog -

Today, we are excited to announce UltraWarm, a fully managed, low-cost, warm storage tier for Amazon Elasticsearch Service. UltraWarm is now available in preview and takes a new approach to providing hot-warm tiering in Amazon Elasticsearch Service, offering up to 900TB of storage, at almost a 90% cost reduction over existing options. UltraWarm is a seamless extension to the Amazon Elasticsearch Service experience, enabling you to query and visualize across both hot and UltraWarm data, all from your familiar Kibana interface. UltraWarm data can be queried using the same APIs and tools you use today, and also supports popular Amazon Elasticsearch Service features like encryption at rest and in flight, integrated alerting, SQL querying, and more. A popular use case for our customers of Amazon Elasticsearch Service is to ingest and analyze high (and increasingly growing) volumes of machine-generated log data. However, those customers tell us that they want to perform real-time analysis on more of this data, so they can use it to help quickly resolve operational and security issues. Storage and analysis of months, or even years, of data has been cost prohibitive for them at scale, causing some to turn to use multiple analytics tools, while others simply delete valuable data, missing out on insights. UltraWarm, with its cost-effective storage backed by Amazon Simple Storage Service (S3), helps solve this problem, enabling customers to retain years of data for analysis. With the launch of UltraWarm, Amazon Elasticsearch Service supports two storage tiers, hot and UltraWarm. The hot tier is used for indexing, updating, and providing the fastest access to data. UltraWarm complements the hot tier to add support for high volumes of older, less-frequently accessed, data to enable you to take advantage of a lower storage cost. As I mentioned earlier, UltraWarm stores data in S3 and uses custom, highly-optimized nodes, built on the AWS Nitro System, to cache, pre-fetch, and query that data. This all contributes to providing an interactive experience when querying and visualizing data. The UltraWarm preview is now available to all customers in the US East (N. Virginia, Ohio) and US West (Oregon) Regions. The UltraWarm tier is available with a pay-as-you-go pricing model, charging for the instance hours for your node, and utilized storage. The UltraWarm preview can be enabled on new Amazon Elasticsearch Service version 6.8 domains. To learn more, visit the technical documentation. — Steve  

Amazon Redshift Update – Next-Generation Compute Instances and Managed, Analytics-Optimized Storage

Amazon Web Services Blog -

We launched Amazon Redshift back in 2012 (Amazon Redshift – The New AWS Data Warehouse). With tens of thousands of customers, it is now the world’s most popular data warehouse. Our customers enjoy consistently fast performance, support for complex queries, and transactional capabilities, all with industry-leading price-performance. The original Redshift model establishes a fairly rigid coupling between compute power and storage capacity. You create a cluster with a specific number of instances, and are committed to (and occasionally limited by) the amount of local storage that is provided with each instance. You can access additional compute power with on-demand Concurrency Scaling, and you can use Elastic Resize to scale your clusters up and down in minutes, giving you the ability to adapt to changing compute and storage needs. We think we can do even better! Today we are launching the next generation of Nitro-powered compute instances for Redshift, backed by a new managed storage model that gives you the power to separately optimize your compute power and your storage. This launch takes advantage of some architectural improvements including high-bandwidth networking, managed storage that uses local SSD-based storage backed by Amazon Simple Storage Service (S3), and multiple, advanced data management techniques to optimize data motion to and from S3. Together, these capabilities allow Redshift to deliver 3x the performance of any other cloud data warehouse service, and most existing Amazon Redshift customers using Dense Storage (DS2) instances will get up to 2x better performance and 2x more storage at the same cost. Among many other use cases, this new combo is a great fit for operational analytics, where much of the workload is focused on a small (and often recent) subset of the data in the data warehouse. In the past, customers would unload older data to other types of storage in order to stay within storage limits, leading to additional complexity and making queries on historical data very complex. Next-Generation Compute Instances The new RA3 instances are designed to work hand-in-glove with the new managed storage model. The ra3.16xlarge instances have 48 vCPUs, 384 GiB of Memory, and up to 64 TB of storage. I can create clusters with 2 to 128 instances, giving me over 8 PB of compressed storage: I can also create a new RA3-powered cluster from a snapshot of an existing cluster, or I can use Classic resize to upgrade my cluster to use the new instance type. If you have an existing snapshot or a cluster, you can use the Amazon Redshift console to get a recommended RA3 configuration when you restore or resize. You can also get recommendations from the DescribeNodeConfigurationOptions function or the describe-node-configuration-options command. Managed, Analytics-Optimized Storage The new managed storage is equally exciting. There’s a cache of large-capacity, high-performance SSD-based storage on each instance, backed by S3, for scale, performance, and durability. The storage system uses multiple cues, including data block temperature, data blockage, and workload patterns, to manage the cache for high performance. Data is automatically placed into the appropriate tier, and you need not do anything special to benefit from the caching or the other optimizations. You pay the same low price for SSD and S3 storage, and you can scale the storage capacity of your data warehouse without adding and paying for additional instances. Price & Availability You can start using RA3 instances together with managed storage in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), EU (Frankfurt), EU (Ireland), EU (London). — Jeff;  

Easily Manage Shared Data Sets with Amazon S3 Access Points

Amazon Web Services Blog -

Storage that is secure, scalable, durable, and highly available is a fundamental component of cloud computing. That’s why Amazon Simple Storage Service (S3) was the first service launched by AWS, back in 2006. It has been a building block of many of the more than 175 services that AWS now offers. As we approach the beginning of a new decade, capabilities like Amazon Redshift, Amazon Athena, Amazon EMR and AWS Lake Formation have made S3 not just a way to store objects but an engine for turning that data into insights. These capabilities mean that access patterns and requirements for the data stored in buckets have evolved. Today we’re launching a new way to manage data access at scale for shared data sets in S3: Amazon S3 Access Points. S3 Access Points are unique hostnames with dedicated access policies that describe how data can be accessed using that endpoint. Before S3 Access Points, shared access to data meant managing a single policy document on a bucket. These policies could represent hundreds of applications with many differing permissions, making audits, and updates a potential bottleneck affecting many systems. With S3 Access Points, you can add access points as you add additional applications or teams, keeping your policies specific and easier to manage. A bucket can have multiple access points, and each access point has its own AWS Identity and Access Management (IAM) policy. Access point policies are similar to bucket policies, but associated with the access point. S3 Access Points can also be restricted to only allow access from within a Amazon Virtual Private Cloud. And because each access point has a unique DNS name, you can now address your buckets with any name that is unique within your AWS account and region. Creating S3 Access Points Let’s add an access point to a bucket using the S3 Console. You can also create and manage your S3 Access Points using the AWS Command Line Interface (CLI), AWS SDKs, or via the API. I’ve selected a bucket that contains artifacts generated by a AWS Lambda function, and clicked on the access points tab. Let’s create a new access point. I want to give an IAM user Alice permission to GET and PUT objects with the prefix Alice. I’m going to name this access point alices-access-point. There are options for restricting access to a Virtual Private Cloud, which just requires a Virtual Private Cloud ID. In this, I want to allow access from outside the VPC as well, so after I took this screenshot, I selected Internet and moved onto the next step. S3 Access Points makes it easy to block public access. I’m going to block all public access to this access point. And now I can attach my policy. In this policy, our Principal is our user Alice, and the resource is our access point combined with every object with the prefix /Alice. For more examples of the kinds of policies you might want to attach to your S3 Access Points, take a look at the docs. After I create the access point, I can access it by hostname using the format https://[access_point_name]-[accountID].s3-accesspoint.[region].amazonaws.com. Via the SDKs and CLI, I can use it the same way I would use a bucket once I’ve updated to the latest version. For example, assuming I were authenticated as Alice, I could do the following: $ aws s3api get-object --key /Alice/object.zip --bucket arn:aws:s3:us-east-1:[my-account-id]:alices-access-point download.zip Access points that are not restricted to VPCs can also be used via the S3 Console. Things to Know S3 Access Points is available now in all AWS Regions, at no cost. By default each account can create 1,000 access points per region. You can use S3 Access Points with AWS CloudFormation. If you use AWS Organizations, you can add a Service Control Policy (SCP) requiring all access points are restricted to a VPC. When it comes to software design, keeping scopes small and focused on a specific task is almost always a good decision. With S3 Access Points, you can customize hostnames and permissions for any user or application that needs access to your shared data set. Let us know how you like this new capability, and happy building! — Brandon

Amazon EKS on AWS Fargate Now Generally Available

Amazon Web Services Blog -

Starting today, you can start using Amazon Elastic Kubernetes Service to run Kubernetes pods on AWS Fargate. EKS and Fargate make it straightforward to run Kubernetes-based applications on AWS by removing the need to provision and manage infrastructure for pods. With AWS Fargate, customers don’t need to be experts in Kubernetes operations to run a cost-optimized and highly-available cluster. Fargate eliminates the need for customers to create or manage EC2 instances for their Amazon EKS clusters. Customers no longer have to worry about patching, scaling, or securing a cluster of EC2 instances to run Kubernetes applications in the cloud. Using Fargate, customers define and pay for resources at the pod-level. This makes it easy to right-size resource utilization for each application and allow customers to clearly see the cost of each pod. I’m now going to use the rest of this blog to explore this new feature further and deploy a simple Kubernetes-based application using Amazon EKS on Fargate. Let’s Build a Cluster The simplest way to get a cluster set up is to use eksctl, the official CLI tool for EKS. The command below creates a cluster called demo-newsblog with no worker nodes. eksctl create cluster --name demo-newsblog --region eu-west-1 --fargate This single command did quite a lot under the hood. Not only did it create a cluster for me, amongst other things, it also created a Fargate profile. A Fargate profile, lets me specify which Kubernetes pods I want to run on Fargate, which subnets my pods run in, and provides the IAM execution role used by the Kubernetes agent to download container images to the pod and perform other actions on my behalf. Understanding Fargate profiles is key to understanding how this feature works. So I am going to delete the Fargate profile that was automatically created for me and recreate it manually. To create a Fargate profile, I head over to the Amazon Elastic Kubernetes Service console and choose the cluster demo-newsblog. On the details, Under Fargate profiles, I choose Add Fargate profile. I then need to configure my new Fargate profile. For the name, I enter demo-default. In the Pod execution role, only IAM roles with the eks-fargate-pods.amazonaws.com service principal are shown. The eksctl tool creates an IAM role called AmazonEKSFargatePodExecutionRole, the documentation shows how this role can be created from scratch. In the Subnets section, by default, all subnets in my cluster’s VPC are selected. However, only private subnets are supported for Fargate pods, so I deselect the two public subnets. When I click next, I am taken to the Pod selectors screen. Here it asks me to enter a namespace. I add default, meaning that I want any pods that are created in the default Kubernetes namespace to run on Fargate. It’s important to understand that I don’t have to modify my Kubernetes app to get the pods running on Fargate, I just need a Fargate Profile – if a pod in my Kubernetes app matches the namespace defined in my profile, that pod will run on Fargate. There is also a Match labels feature here, which I am not using. This allows you to specify the labels of the pods that you want to select, so you can get even more specific with which pods run on this profile. Finally, I click Next and then Create. It takes a minute for the profile to create and become active. In this demo, I also want everything to run on Fargate, including the CoreDNS pods that are part of Kubernetes. To get them running on Fargate, I will add a second Fargate profile for everything in the kube-system namespace. This time, to add a bit of variety to the demo, I will use the command line to create my profile. Technically, I do not need to create a second profile for this. I could have added an additional namespace to the first profile, but this way, I get to explore an alternative way of creating a profile. First, I create the file below and save it as demo-kube-system-profile.json. { "fargateProfileName": "demo-kube-system", "clusterName": "demo-news-blog", "podExecutionRoleArn": "arn:aws:iam::xxx:role/AmazonEKSFargatePodExecutionRole", "subnets": [ "subnet-0968a124a4e4b0afe", "subnet-0723bbe802a360eb9" ], "selectors": [ { "namespace": "kube-system" } ] } I then navigate to the folder that contains the file above and run the create-fargate-profile command in my terminal. aws eks create-fargate-profile --cli-input-json file://demo-kube-system-profile.json I am now ready to deploy a container to my cluster. To keep things simple, I deploy a single instance of nginx using the following kubectl command. kubectl create deployment demo-app --image=nginx I then check to see the state of my pods by running the get pods command. kubectl get pods NAME READY STATUS RESTARTS AGE demo-app-6dbfc49497-67dxk 0/1 Pending 0 13s If I run get nodes  I have three nodes (two for coreDNS and one for nginx). These nodes represent the compute resources that have instantiated for me to run my pods. kubectl get nodes NAME STATUS ROLES AGE VERSION fargate-ip-192-168-218-51.eu-west-1.compute.internal Ready <none> 4m45s v1.14.8-eks fargate-ip-192-168-221-91.eu-west-1.compute.internal Ready <none> 2m20s v1.14.8-eks fargate-ip-192-168-243-74.eu-west-1.compute.internal Ready <none> 4m40s v1.14.8-eks After a short time, I rerun the get pods command, and my demo-app now has a status of Running. Meaning my container has been successfully deployed onto Fargate. kubectl get pods NAME READY STATUS RESTARTS AGE demo-app-6dbfc49497-67dxk 1/1 Running 0 3m52s Pricing and Limitations With AWS Fargate, you pay only for the amount of vCPU and memory resources that your pod needs to run. This includes the resources the pod requests in addition to a small amount of memory needed to run Kubernetes components alongside the pod. Pods running on Fargate follow the existing pricing model. vCPU and memory resources are calculated from the time your pod’s container images are pulled until the pod terminates, rounded up to the nearest second. A minimum charge for 1 minute applies. Additionally, you pay the standard cost for each EKS cluster you run, $0.20 per hour. There are currently a few limitations that you should be aware of: There is a maximum of 4 vCPU and 30Gb memory per pod. Currently there is no support for stateful workloads that require persistent volumes or file systems. You cannot run Daemonsets, Privileged pods, or pods that use HostNetwork or HostPort. The only load balancer you can use is an Application Load Balancer. Get Started Today If you want to explore Amazon EKS on AWS Fargate yourself, you can try it now by heading on over to the EKS console in the following regions: US East (N. Virginia), US East (Ohio), Europe (Ireland), and Asia Pacific (Tokyo). — Martin

Best WordPress Plugins for Amazon Affiliates

HostGator Blog -

The post Best WordPress Plugins for Amazon Affiliates appeared first on HostGator Blog. If you’re looking to monetize your WordPress site there are a lot of approaches you can take. However, one of the most effective routes, especially for beginners, is affiliate marketing. With affiliate marketing you recommend products to your readers, and if they purchase a product through your link, then you’ll receive a commission. Succeeding with affiliate marketing is a lot more complex than that, but the basic idea remains the same.  One of the most popular ways to earn money as an affiliate is via the Amazon affiliate program. Luckily WordPress and the Amazon affiliate program work really well together. There’s tons of plugins that’ll help you maximize your earnings and optimize your affiliate pages. Below we dive into how the Amazon affiliate program works, and highlight the best WordPress affiliate marketing plugins that’ll help you generate more revenue and making your life as an affiliate much easier.  The Amazon Affiliate Marketing Program If you’re running a WordPress site that’s getting decent traffic, then you’ll have a multitude of ways to start generating an income. You could start selling your own products via an eCommerce store, sell your services as a freelancer, utilize display ads via a network like AdSense, or add affiliate links to your site and start making income via affiliate marketing. Affiliate marketing is one of the easier ways to start generating revenue from your website. Essentially, you sign up for an affiliate program. In this case, you’d be signing up for the Amazon Affiliate program called Amazon Associates.  Once you’re approved, you’ll be able to start promoting products via Amazon and generating what is known as affiliate sales. Whenever you mention a product on your site, you’ll include a link to Amazon of the product you’re highlighting or reviewing. These links will contain your unique tracking code, so whenever someone clicks the link from your site and purchases a product from Amazon you’ll get a percentage of the sale. Entire websites are built around reviewing Amazon products, or you can add a link whenever it makes sense within your content. The links that you’ll generate will be completely unique to you, so Amazon will be able to determine that it was you that sent that visitor to their site.  Benefits of Adding Affiliate Links to Your WordPress Site There are tons of different affiliate programs you can sign up for. For example, if you want to become a hosting affiliate, you can sign up for the affiliate program right here at HostGator. No matter what niche your website is in, you can find an affiliate program that matches up. However, Amazon is one of the most widely used affiliate programs in the world for a good reason. Here are the most significant benefits you’ll receive when you start adding Amazon affiliate links to your WordPress site:  1. They’re a Trusted Marketplace Amazon is one of the most trusted retailers on the planet. Making good money as an affiliate is all about trust, and Amazon already has that.  To make money as an affiliate people are going to need to buy the products you’re recommending. Just clicking the link and browsing through the site isn’t enough. They need to either buy the product you’re suggesting, or in the case of Amazon, but something from the store while the cookie window is still active. One of the biggest reasons people don’t follow through with their purchases is because they don’t trust the site. The typical scenario goes like this: A visitor comes to your website and likes the product you’re recommending, so they click your link and head over to the site. However, when they get there, they don’t fully trust the site with their valuable credit card information, so they end up not buying.  But, when you’re recommending Amazon products, you never have to worry about this. Most people already have their credit card information on file and have bought things from Amazon in the past. All they have to do is click a button.  2. The Amazon Marketplace is Massive Amazon sells products in virtually every niche in the world. So, no matter the topic of your site, you can probably find a handful of products on Amazon that are worth promoting. Whether you’re running a site about dog training, aerial drones, stress relief, meditation, kayaking, or any other topic you’ll be able to find a few highly rated products your readers are likely to buy. This makes it easy to make money as an affiliate, no matter your niche.  3. Revenue From Total Sales Unlike a lot of other affiliate programs, when someone heads over to Amazon via your affiliate link, you’ll receive a commission for any products they order for a 24 hour period.  So, if someone heads over to Amazon to check out a new blender from your affiliate link, but they also end up buying a new dishwasher, diapers, and some supplements, then you’ll receive a commission for that too. This window also lasts for 24 hours, even if the person ends up leaving the site and coming back later that day to make a purchase.  There’s also an extended 90-day window for any products added to the cart, but aren’t immediately purchased. So, if someone visits Amazon via your link and adds a product to their cart, then they still have 90 days to buy that product–and you’ll still receive a commission.  Why You’ll Want to Use an Amazon Affiliate Plugin If you have a WordPress site, you can get away without using an Amazon affiliate WordPress plugin. But using one will make your life a lot easier. When you’re just starting, and you only include a single affiliate link here and there in your blog posts, you might not see the value in using a WordPress plugin. It’s easy enough to find the product on Amazon and copy and paste your link from your affiliate dashboard. But, as you begin recommending more and more products, it can be difficult to decipher which links are making you money and which links are a total waste of time. By using some of the WordPress plugins,we’re going to highlight in the list below, you can better manage your Amazon affiliate links and even generate more revenue. When you first start adding Amazon affiliate links to your WordPress site, it can be thrilling to get that first sale. But as your affiliate income grows, you’ll want measurable data that you can work from. This will help you better optimize your site and your content so that you can increase your affiliate income even further.  Using an Amazon affiliate plugin offers you all kinds of benefits like: Saving time by searching for products within your WordPress editorCreate product comparison tables, so visitors can quickly see how products stack upTrack your links and see what products are generating you the most revenueAutomatically direct links to the correct Amazon storefrontKeep your product listings up to date with automated updates 5 Best WordPress Plugins for Amazon Affiliates As you start to search for Amazon affiliate plugins you’ll notice that there are a ton of different plugins available.  Below we highlight five of the best Amazon affiliate plugins for WordPress:  1. EasyAzon EasyAzon helps you quickly create Amazon affiliate links from within your WordPress dashboard. This will save you a ton of time, since you don’t have to login to your Amazon Associates account and manually create an affiliate link. You also have the ability to create image affiliate links, product blocks which showcase the features of your products, and call to action buttons. If you want even more features, then you can upgrade to the premium version of the plugin, which gives you access to features like: Link cloaking, so your affiliate links don’t look like affiliate linksLink localization, so your affiliate links will automatically send to the correct store, i.e., amazon.co.uk, instead of amazon.com.The ability to create and track multiple affiliate IDs, so you can see which links convert the best 2. AAWP (Amazon Affiliate WordPress Plugin) AAWP is one of the most popular Amazon affiliate WordPress plugins. The goal of this plugin is to help you increase the total value of your Amazon affiliate pages. This plugin achieves its goal by offering you different options to display your products, so they’re more enticing to your readers. For example, you’ll be able to display your recommended products in the following ways: Comparison tables that show how different products stack upProduct boxes that highlight product features and benefitsBestseller lists to showcase popular productsWidgetized sections to add products throughout your site Plus, all of the product information will be updated automatically to reflect the latest pricing and product information, because it pulls directly from Amazon. 3. Amazon Link Engine Amazon Link Engine was created by the team behind the popular service, GeniusLink, the same link tracking and management service that’s used by writers like Ryan Holiday and companies like BMW and NBC.  The goal of this plugin is to help boost your sales and commissions by localizing all of your links. So, whenever someone clicks a link on your site they’ll be brought to the proper Amazon storefront.  This plugin takes care of the heavy lifting for you and all of your links are localized automatically. All you have to do is install and activate the plugin, then sync your Amazon Associates IDs. This plugin functions differently than other link localization plugins, because it looks at more than just the product ID. This ensures that the traffic gets sent to the product they’re most likely to buy.  Here’s a quick run down of its feature set: Automatic link localization for all of your Amazon affiliate linksRevenue maximization, since visitors won’t be sent to blank product pagesFast setup and configuration, just a couple of clicks 4. AmaLinks Pro AmaLinks Pro is a relatively new WordPress plugin. It was built because the creators felt that the existing plugins weren’t delivering what they needed in a plugin. So, they built their own plugin that meets all their needs. It’s since been endorsed by sites like Niche Pursuits and Human Proof Designs. The goal of this plugin is to make integrating your WordPress site with Amazon as simple as possible. It’s equipped with features that allow you to: Search for products within your WordPress editor, so there’s no need to move back and forth between Amazon and your site.Quickly insert links into your content with a few clicks.Insert image links, so that your product pictures link out to Amazon too.Create a showcase box to highlight all the unique features of the product you’re promoting.Build comparison tables to help readers compare the differences between products.  5. AzonPress AzonPress is an intuitive all-in-one plugin to integrate WordPress with Amazon. A lot of plugins equipped with a ton of different features can get overwhelming to use, but not this plugin. Think of it like a combination of all the above plugins highlighted on this list. With this WordPress plugin you can do a lot of things. For example, you can create Amazon affiliate stores that function as if they were an eCommerce store. Let users browse through different products you’re highlighting and send them over to Amazon via an attractive “buy button”. There are a variety of other different ways you can showcase your affiliate products as well, like: Product comparison tables, so your visitors can compare products quicklyResponsive product tables that look good on every screen sizeAutomated product updates, so you always have the latest product info It’s also equipped with a built-in affiliate management dashboard, so you can easily see your earnings, view historical affiliate data, track which links are bringing you the most revenue, and a lot more.  By now you should have a better understanding of how the Amazon affiliate program works, and how you can best integrate it into your WordPress site. You don’t have to install every plugin from the list above, but instead choose one or two that’ll help you achieve your goals. Ready to optimize your WordPress site for affiliate earnings? Check out these top WordPress themes for affiliate websites. Find the post on the HostGator Blog

How to Use a Business Loan to Finance Your Side Hustle

HostGator Blog -

The post How to Use a Business Loan to Finance Your Side Hustle appeared first on HostGator Blog. The nice thing about a side hustle is that you can put as much or as little effort into it as you’d like. It can be a robust source of extra income for you, or a fun hobby that generates a little spending money. Let’s say that your side hustle starts to pick up steam, and you’re starting to see it as a legitimate small business play. Maybe it’s time to put more money behind your side hustle operations—to pay for more marketing, improved material quality, or whatever else you need to bring your side hustle to the next level. Where can you get the funding? It’s tempting to use your personal savings—few options are more convenient than going to the ATM—but if you’re not prepared to make that kind of investment in a side hustle, what are your alternatives? If donations or loans from your personal network aren’t an option, it’s time to explore how you can use business loan products to finance your side hustle. How to lay the groundwork for a side hustle loan Getting a loan for your side hustle is going to be difficult if you don’t prepare first. You’ll need to treat your side hustle as a legitimate business first—otherwise, why would a business loan lender treat it any differently? Here’s a quick business loan requirements checklist: Write up a business plan: For some lenders, this is a requirement. Regardless, it’s still good to have a written document outlining where your business stands and where you want it to go—especially with an influx of new funding.  Separate your business and personal finances: If you’re using your personal credit card and/or bank accounts to run your side hustle, you’re asking for a paperwork headache. You’ll also fail to build crucial business credit. Prepare financial documents: Get ready to bring any documents that show how you and your business are doing—bank statements, balance sheets, tax returns, A/R and A/P aging—to the table.Provide collateral: There are few unsecured business loans out there. Be prepared to offer personal collateral to secure the loan. Improve your credit scores: Lenders often look at both your business and personal credit scores, so clean up your credit (or learn about what goes into building it) before you apply. Different lenders, whether they’re traditional banks or online lenders, will have different loan requirements and expectations. Getting the above things in order, however, is a good place to start for any loan. Your side hustle loan options The longer you’ve been “in business” with your side hustle and the better you’ve done, the better your loan options will be. For example, most traditional bank loans won’t be available to any small business unless that business has been in operation for at least a few years. Online lenders more readily deal with newer businesses, but charge higher interest rates over shorter repayment periods as a result. You’ll have to crunch the numbers to see which kind of business loan makes sense for you. That said, here’s a general roundup of the best side hustle loan options that you might qualify for: SBA Microloans The Small Business Administration’s loan program is the crown jewel of the small business financing world. While most SBA loans are for well-established businesses, the Microloan program (which delivers loans ranging from $500-$50,000 for new businesses) is a great first step for turning a side hustle into a bigger business. Online short-term loans or lines of credit A variety of online lenders have entered the lending space, and are willing to grant short-term loans or revolving lines of credit to eligible businesses. For a major investment, a short-term loan could work well for a side hustle; a line of credit is best if you see yourself with ongoing costs you’ll need to cover, such as taking advantage of seasonal discounts to boost your inventory. Many lenders have time-in-business requirements: for some, it’s as little as three months, but more often it’s at least 12 months. Equipment financing If your side hustle needs help purchasing a new piece of major equipment, such as a vehicle or piece of kitchen equipment, equipment financiers can extend the exact amount you’ll need to cover the expense, which you’ll repay plus interest. Inventory financing Similar to equipment financing, inventory financing is when you use a loan to cover the exact amount you’ll need for a big inventory purchase. If you need the money to get a great deal now on inventory, this options works well. You won’t need additional collateral for inventory or equipment financing, as the inventory or equipment itself secures the loan. Personal loans It’s worth mentioning that you can absolutely use a personal loan to finance a side hustle. Personal loans may be easier to qualify for than business loans (especially if you are just getting your side hustle off the ground), and often come with lower interest rates and no collateral requirements. The total amount you can borrow with a personal loan is lower than with a business loan (many personal loans max out at around $30,000), and you won’t build business credit. But for your purposes, for now, that might work fine. Business credit cards Business credit cards are also an excellent financing tool for side hustlers. Some elite cards come with a 0% introductory APR, or other benefits that you can use to reinvest in your business. When financing larger purchases that you might not have the cash on hand for right away, a credit card is a good choice that helps you build business credit as well. How you can use your loan Every business loan and every lender can have different restrictions or requirements around loans. For example, you can’t use SBA loans to pay off existing debt or to purchase real estate. You are allowed to use them for working capital needs or the purchase of inventory, equipment, and other assets. Some forms of financing will be for specific use cases, such as the equipment or inventory financing. Some credit cards, on the other hand, allow you to perform balance transfers—which is helpful if your new card has a low or 0% interest rate for a certain amount of time, helping you manage what would otherwise be ballooning business debt. If you’re looking for a loan that can help you build out your side hustle in every way possible—perhaps you have an eye on investing in software to improve operations, in marketing to gain more leads, or even hiring an employee or two—then consider a personal loan, a microloan, or even crowdfunding via a platform like Kickstarter. You’ll have more latitude. Just remember, you need to make a business case for your loan—a real reason for taking out the loan, with an expected return on investment—or you’ll struggle to repay it. The bottom line on side hustle loans Financing a side hustle isn’t unlike financing any other small business endeavor. Maybe your ambitions for a side hustle are a little less grand; but if you’re taking out money to finance its growth, that’s still a serious investment. Take this process seriously, and you’re much more likely to see success—and to set yourself up to make this a full-time gig, if that’s what you’re building towards.  Find the post on the HostGator Blog

How to Automate Personalized Video Messages via Facebook Ads

Social Media Examiner -

Want to create a personal video message for each of your ad prospects? Wondering how to streamline the process? In this article, you’ll discover how to deliver personalized video messages to qualified Facebook leads at scale. Why Create Personalized Video Messages? Sometimes it seems the days of good social media comments are in the past […] The post How to Automate Personalized Video Messages via Facebook Ads appeared first on Social Media Marketing | Social Media Examiner.

New cPanel Licensing Structure and Tips for Managing Your Licenses

InMotion Hosting Blog -

Making news in the hosting world, cPanel has changed its pricing structure. Now, instead of being able to create unlimited cPanel accounts, users will be billed based on usage. InMotion Hosting is doing everything in its power to make this an easy transition. Read on to find out how you can reduce the amount and cost of your cPanel usage. Find out how many cPanels you have Review the new pricing guide Consolidate your cPanel accounts View frequently asked questions While the new pricing system went into effect on December 3, 2019, we at InMotion understand the impact of sudden changes, and want to allow you time to adjust. Continue reading New cPanel Licensing Structure and Tips for Managing Your Licenses at The Official InMotion Hosting Blog.

Pages

Recommended Content

Subscribe to Complete Hosting Guide aggregator