It’s been said that machine learning [ML] is what will finally make sense of the so-called big data. Now ML has its own programming language.
Earlier this year, BigML released WhizzML, a domain-specific language [DSL] designed specifically for building automatic machine learning workflows. It’s a complete language, including variables, data structures, conditionals, and mathematical functions, that comes with native calls to create data sets.
And WhizzML, like all good machine learning resources, is API-first, making it “composable,” allowing for better, faster automation, said Poul Petersen, CIO of BigML.
“The vision is to make it easy to go from data to insights,” Petersen said. WhizzML could provide a way to get over these final machine learning hurdles, using workflow automation and automatic model selection to write and share workflows, making it possible to extract the workflows’ complexity into reusable executions.
WhizzML does allow for algorithm options too, by enabling the user to test out different options to find the most accurate result to move ahead with. It also automates the data perfection process, identifying where missing data is and what combination is best for the desired outcomes. This sort of model tuning can help with feature selection, even recognizing which features to use and which to avoid when they are adding to the noise.
Automated Learning
In a recent webinar on the topic, Petersen talked about the discoveries that can come from automated insights, including learning:
- How to reduce churn.
- How to increase conversion.
- How to improve medical diagnoses.
- How to reduce fraud.
But while many big businesses are hiring academics to make sense of machine learning and the next level of deep learning, that doesn’t help the 98 percent of small businesses in the U.S. that can’t afford an in-house data scientist. That’s why machine learning automation is the hottest buzzword.
When machine learning only partly automated, Petersen points to several questions that arise:
- Which of the hundreds of algorithms should you use?
- How can your machine learning scale with the data?
- How do you deal with real data? Missing data? Mixed data types?
- How do you tune your algorithm for best performance? Use parameters?
- How can you automate those decisions?
Petersen pointed out that the last bit about automating that data is the crucial part of automating machine learning because this is the part that enables real-time decision making — essentially automation allows the process to be fast and frequent.
“The vision [of machine learning automation] is to make it easy to go from data to insights.”
WhizzML was built with automation in mind and designed around hand-selected algorithms that meet the real data and scalable needs mentioned above in an understandable way. Petersen says that once this is in place, machine learning can be tweaked toward accuracy.
“The dirty secret of machine learning is that the largest improvements in accuracy are more often from feature engineering and model tuning rather than selecting different algorithms,” he said.
Automated Workflows in Machine Learning
In the context of machine learning, a workflow is when specific tasks can be linked together in order and then automated. Machine learning is by nature iterative, and many machine learning tools require repetitive and even manual tasks. With an automated workflow, you are extracting the outside variables and making it more valuable.
Petersen explained that “Without a workflow, output becomes a secondary objective with too much focus on infrastructure.” He went on to point out that “Not everybody can implement complex workflows, but many can reuse them.”
Below you see an example of a typical use case for real estate big data workflows, in order to predict the sale price for a home because, while “location, location, location” gets all the attention, pricing matters most of all.
This machine learning model is used to look for deals among all the houses currently for sale, and then, on the output side, it tells the difference between a predicted price and a list price. This workflow is fine for a smaller company but doesn’t scale to other cities for a broader, more accurate view, because this typical workflow means that you manually have to deal with different sources, feature engineering and post-processing of data. It leaves a lot of room for human error and inexact output.
“Now these higher level tasks like modeling and tuning can be automated as well, making it possible to completely remove the need for human interaction in the path going from data to insights,” Petersen said.
Below you can see an example Workflow Map of WhizzML.
Behind the code of the WhizzML API
The webinar above walks through the code that goes into using WhizzML via that then processes the machine learning via the API. Here’s an overview of the WhizzML API resource types:
- Scripts — written expression of a workflow, defining the inputs and outputs required to run the workflow.
- Library — containing WhizzML code that can be included into other scripts, allowing you to extract reusable ideas without duplicating.
- Execution — specifies a script and a given set of inputs, which it will then execute for a set of outputs. This allows you to see the history of your machine learning endeavor, including what scripts, inputs and outputs were used.
As a part of the WhizzML release, the company has already updated the Python, Node.js, Swift and Objective-C bindings. It also includes a gallery of other users’ scripts so you don’t have to create your own scripts right away, as well as importing straight from GitHub repositories. Some of these are for free while others are charged for by the creator, but all are reviewed by the BigML team.
Since machine learning automation has a touch more art than pure science, WhizzML has an interactive shell called REPL for debugging and experimenting with the programming language, like with syntax and code change. The BigML team is working to open source the REPL code so it makes it easier for you to play around with it from within your own environment.