Amazon Web Services released SageMaker Studio at re:Invent 2019. As a fully integrated development environment for machine learning, SageMaker Studio allows storage and collection of all the development facets users need in one place. It promises great assets for established ML practitioners, as well as for those less experienced in the domain. The main purpose of the Studio should be speeding up and simplifying the whole ML process workflow.
What’s important to note is that all tools inside SageMaker Studio are easy to use from the console; it needs to be quick and understandable, especially for users that are not developers. Why is this important?
Business owners often go through a dilemma when they need to hire a data scientists or engage their own company in data science projects. They are assessing if they are good candidates for such projects. Would it be beneficial for them to join, what are possible positive or negative side effects? Sometimes they just need more understanding of the whole process to be able to mitigate the possible risks. While all these questions could be answered by the data scientists estimating the situation and analyzing available data, it is useful to have an accessible toolkit that showcases the potential and suitability of a data science project. Utilizing some of the tools available in SageMaker studio is intuitive, even for inexperienced users.
This blog will show the tools available in SageMaker Studio and deep-dive into SageMaker AutoPilot, a tool to solve or speed up the biggest part of all data science projects – data preparation and feature engineering.
First, let’s look at the general features of SageMaker Studio Tools:
Some of the tools available in SageMaker Studio are new, while some are improvements of existing tools.
SageMaker Notebooks are like Jupyter notebooks in look and feel. They are among the first choices of data scientists when beginning a project, since they offer an option to track code and notes at the same time. However, AWS Notebooks has provided much more – Studio delivers single-click notebooks for the SageMaker environment. The underlying compute resources are fully elastic and integrated with other AWS services, so it’s easy to add or remove additional resources without interrupting work. AWS has had SageMaker Notebooks in their catalog since 2018, but with the release of SageMaker Studio, usage has been simplified and collaboration has been improved as multiple users can interact with the notebooks simultaneously.
SageMaker Debugger is made to simplify the debugging process. This feature is expected to improve over time. AWS provides in-depth explanations, including code snippets showing how the tool can help developers debug Tensorflow bugs. This feature is probably the most important step to have ML processes automated. The hardest parts of any ML process are preparation and debugging, so this feature is quite welcomed by data scientists, whether debugging fully trained or new models.
SageMaker Experiments are collections of trials – whether we’re referring to a group of different methods that are applied to the same problem or other similar training jobs. This is handy since you often have no way of knowing how long a job will continue to run, or if it has silently crashed in the background. The Experiments feature should be a useful addition for cloud-based jobs, large data sets, or GPU-intensive projects.
SageMaker Model Monitor is one of the most important features after the post-deployment phase. Even after a model is put into production, the work is not completely done. Model monitoring is necessary because a model’s performance may worsen over time due to changing trends in the input data or unexpected problems. Model Monitor helps alert model maintainers about input data drift, and consequently monitors the model’s performance. A system that could alert model maintainers to these changes automatically is valuable because it mitigates the consequences of the data change and lowers the cost of post deployment in the long run. That way, data scientists on the team can focus on other important things. Considering all this, SageMaker Monitor presents a clear benefit of standardizing model hosting on SageMaker Endpoints.
SageMaker AutoPilot is part of the AutoML category, which automatically trains ML models from CSV data files. The AutoPilot is a great step forward in the AWS’s attempt to accelerate the most tedious part of ML development – the exploratory analysis or preparation phase. Approximately 70% – 90% of the work on ML projects is dedicated to cleaning and preparing the data in order to be optimally used by ML algorithms. Considering this, every tool that can lower the mentioned percentage and amount of time spent on the exploratory analysis will bring a reduction in the costs, and contribute to the faster solution creation, which again provides data scientists with more time to concentrate on something else.
At the time of release, AWS SageMaker Studio was available only in the US East (Ohio) region, but now it is also available in S. East (N. Virginia), US West (Oregon), and Europe (Ireland) and in time we expect its availability to spread to even more regions.
Here we will show how to start with Studio from the console and provide some useful links for official and more detailed documentation. An AWS SSO or IAM account is required to sign into AWS Studio – starting through the console is the easiest way.
This is the SageMaker Studio GUI after opening the Studio.
Here it’s possible to click one button to open a new notebook and start deployment of the model or an AutoPilot experiment.
To proceed with the auto pilot experiment, the user needs to enter the name, S3 bucket where the data in csv format is located, and where the results will be stored. The user also needs to define the target they want to classify or predict on. It’s important to choose between different types of ML problems, for now the choice is between classification or regression, but in time AWS will add more options. If unsure, the user can leave Auto option selected and SageMaker studio should recognize the problem it’s facing there. Choosing between having a complete experiment or not, the user decides if they want to have an optimized model after its completion or just a general idea what models are best to try out, given the data. The second option is shorter in terms of execution but doesn’t provide the tuned algorithm.
After the setup, AutoPilot starts the process of data analysis, feature engineering, and model tuning if a complete experiment is chosen in the previous step. Data analysis and feature engineering are the most time-consuming steps in the ML workflow, so having those two solved automatically allows data scientists to spend more time on more complex steps. For now, it’s only necessary to have the data ready in CSV format with the recommendation that it contains some dose of structure – i.e. if data is categorical, having different spelling for the same thing is not optimal – “red shirt”, “shirt red”, “red shirts” will be recognized as different categories by the AutoPilot even though they present the same category, due to the non-standardized naming.
After the first step of data analysis is finished, the user gets two notebooks generated as information.
The first one is a Data Exploration notebook, and the second one is a candidate generation notebook.
The Data Exploration notebook provides insights about the dataset (in our example we use customerTargeting). The notebook is informative and gives ideas on what should be done to optimize the future ML workflow and data usage, starting with “what’s the percentage of missing values” to the more complex “some data seems to be outliers”.
The second notebook created for the user by AutoPilot is “SageMaker Autopilot Candidate Definition”. This notebook provides info on the suggested workflow together with the code definition and descriptions, which is helpful if the user doesn’t have much experience with the suggested models.
At the Model tuning step, it tries out multiple models and hyperparameters in order to get the best score on different metrics. This step will take more time, but again, it’s not required for the user to know anything about the code that runs in the background. Once finished, it’s possible to click on “Deploy model”.
To conclude, AWS SageMaker Studio provides features that guide the user from end to end in the ML workflow in just a few clicks, while the user doesn’t need to know much of the coding. This approach could be a great starting point for familiarizing the user with the data and the modeling options. It’s a great asset for business users who want to understand the main principles, or data science teams who are starting a project and want to speed up the first phases in order to quickly get to innovative solutions.
To see more about topics mentioned in this blog post, stay tuned for the Part 2 of this series, where we will deep-dive into more of Sagemaker’s Studio features and walk through some real use cases.
For more details on AWS SageMaker Studio please review the official documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html