- Home
- >
- Software Development
- >
- Building a Lakehouse with Databricks and Machine Learning – InApps 2022
Building a Lakehouse with Databricks and Machine Learning – InApps is an article under the topic Software Development Many of you are most interested in today !! Today, let’s InApps.net learn Building a Lakehouse with Databricks and Machine Learning – InApps in today’s post !
Read more about Building a Lakehouse with Databricks and Machine Learning – InApps at Wikipedia
You can find content about Building a Lakehouse with Databricks and Machine Learning – InApps from the Wikipedia website
When it comes to data for machine learning (ML) applications, often times a database system just doesn’t cut it. You need something bigger, like a data warehouse or data lake. There’s also an emerging class of specialist AI and big data platforms that are pitching something in-between a development platform and a data warehouse.
One such company is Databricks, which bills itself as a “unified platform for data and AI.” It offers large-scale data processing, analytics, data science and other services.
Richard MacManus
Richard is senior editor at InApps and writes a weekly column about what’s next on the cloud native internet. Previously he founded ReadWriteWeb in 2003 and built it into one of the world’s most influential technology news and analysis sites.
To find out more about Databricks’ strategy in the age of AI, I spoke with Clemens Mewald, the company’s director of product management, data science and machine learning. Mewald has an especially interesting background when it comes to AI data, having worked for four years on the Google Brain team building ML infrastructure for Google.
I started by asking Mewald how Databricks relates to modern database systems, such as Apache Cassandra and MongoDB?
He replied that Databricks is “database agnostic.” The company specializes in large scale data processing, he said, but the real key to its approach is the data lake theory.
A data lake is a repository of raw data stored in a variety of formats — anything from unstructured data like emails and PDFs, to structured data from a relational database. The term was coined in 2011, as a modern variation of the late-1980s concept of a data warehouse. A key difference: data lakes were designed to deal with the internet and its masses of unstructured data.
In a blog post from January, Databricks extended the data lake idea by coining a new term: the lakehouse. It was described as “a new paradigm that combines the best elements of data lakes and data warehouses.”
It should be noted that, unlike data warehouses, the data lake concept has not been universally accepted in the industry. Business Intelligence analyst Barry Devlin wrote in response to the Databricks post that “while often claimed to be an architecture, the data lake has never really matured beyond a marketing concept.” He wonders, “can the lakehouse do better?”
While “the lakehouse” might be contentious, Databricks does at least have a product that actually implements the theory: Delta Lake. It aims to ensure the reliability of data across data lakes at a massive scale; the technology was open sourced last April.
“A couple of years ago we built a product called Delta Lake,” Mewald told me, describing it as “both a storage format and a transaction layer.”
“It basically gives you similar capabilities of a data warehouse, on top of a data lake,” he continued, “and that’s why the way to think about Databricks is, we are database agnostic; you can ingest data into Databricks and into a delta lake, from any data source. So, let’s say from Cassandra or MongoDB. And then we provide you with this optimized format, an optimized query engine, and transactional guarantees for querying that data for all kinds of use cases and applications.”
Machine learning is another key part of Databricks’ offering. The company claims that it “streamlines ML development, from data preparation to model training and deployment, at scale.” MLflow is an open source framework that Databricks released to help with this. Databricks provides a managed version of MLflow in its platform (Janakiram MSV profiled MLflow last year for InApps, and also wrote a tutorial for it).
I was curious about Mewald’s background at Google, which is known as a pioneer in applying ML to consumer apps – like Gmail, ad personalization, Google Assistant, and YouTube video recommendations. What did he learn there about how ML is being used in modern applications?
Mewald replied that he got to “see any and all applications of machine learning” while working at Google. However, he thinks other companies have now caught up to Google in terms of applying ML — including, not surprisingly, his current employer.
“What I find really exciting about Databricks is that I actually now see the exact same diversity of use cases with Databricks customers. It’s actually a myth that a company like Google is way, way, way ahead in terms of ML applications.”
The developer experience, though, is only getting more complicated — thanks to distributed computing, Kubernetes, DevOps and other currently popular cloud native technologies. Adding machine learning to a developer’s plate only increases the complexity they have to deal with. So I asked Mewald what his advice is to developers, when it comes to integrating ML into their apps?
He first noted that “machine learning really is a paradigm shift in how we think about developing.”
“In software,” he continued, “you write code, you write a unit test, and it behaves the same way every time you run it. In machine learning, you write code and there’s this data dependency; and every time you train your machine learning model, it will behave differently because it’s inherently stochastic and the data changes. [So] it’s not as deterministic.“
The problem, Mewald said, is that a lot of developers are using older software engineering tools — some of them created “decades ago” — for ML. So he advises developers tackling ML today to choose “modern developer tools” such as MLflow.
My final question for Mewald was a speculative one. It still seems very early for machine learning, particularly from an application perspective, so what does he think the key challenges will be over the next few years as ML matures?
“Machine learning is where data engineering was 10 years ago,” he replied. “Like, ten years ago if you asked someone to write a program to crunch through terabytes of data, it was a big deal — there were just a handful of people on the planet who could do that.”
Today though, the same task can be done using a tool like Databricks. Or as Mewald put it, you input “a Spark SQL query and it just magically works.”
But ML is still at that awkward stage, where there is a lot of manual work to it and specialist knowledge is required.
“In most cases, when we build machine learning models today it’s a one-off,” he explained. “It’s this like stitched together thing, and maybe it works and they can just get it over the line and then you’re done — but it’s not maintainable and not repeatable.”
So, much like the transition data engineering went through, ML will have to become much more accessible for more people. To achieve that, the tools need to become easier to use. Maybe to the point, Mewald added, where “anyone who can write a SQL query can do machine learning.”
Perhaps by then, the lakehouse concept will have been proven out too — but time will tell whether the industry adopts it.
Feature image via Pixabay.
At this time, InApps does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: [email protected].
Source: InApps.net
List of Keywords users find our article on Google:
databricks |
databricks questions |
mlflow |
databricks careers |
databricks training |
databricks jobs |
hire databricks developers |
databricks spark |
mongodb online query editor |
“databricks” |
mlflow tutorial |
data warehouse specialist |
what is databricks |
data warehouse specialist jobs |
databricks linkedin |
databricks software |
databricks sql |
machine learning repository |
databricks review |
databricks news |
databricks architecture |
spark sql databricks |
home assistant influxdb |
databricks tutorial |
data bricks |
data bricks icon |
influxdb training |
databricks delta lake |
databricks delta |
mlflow models |
databricks consultants |
managed cassandra |
databricks.com |
www.lakehouse.com |
mlflow logo |
databricks optimization |
databricks culture |
companies like databricks |
data warehouse icon |
ml flow |
databricks cost |
delta lake |
machine learning app development |
“mlflow” |
databricks mlflow |
databricks distributed deep learning |
machine learning icons |
influxdb query |
delta lake databricks |
databricks technology |
databricks consulting |
databricks logo |
databricks solutions |
databricks testing |
machine learning databricks |
databricks products |
sparksql |
data lake wikipedia |
databricks blog |
planet paradigm 2022 |
managed mlflow |
apache crunch |
data warehouse icons |
databricks training free |
pitching machine with net |
databricks 2020 |
is databricks free |
databricks kubernetes |
databricks apache spark |
influxdb home assistant |
what is databrick used for |
mongo observability |
databricks clients |
google business intelligence analyst |
influxdb google cloud |
building enterprise-grade blockchain databases with mongodb |
spark sql tutorial |
spark databricks |
bricks background |
gitops tutorial |
apache spark sql tutorial |
influxdb managed |
spark mongodb |
data lake tutorial |
hire core ml developer |
cassandra kubernetes deployment |
spark sql training |
sql version control |
upload raw dna data |
machine data |
linkedin big data in the age of ai |
databricks.com linkedin |
linkedin big data in the age of ai course |
m.ewald building |
databricks reviews |
hire influxdb developer |
01 big bricks |
databricks learning |
lakehouse com |
clemens mewald |
mlflow releases |
databricks ui |
google developers structured data |
ml flow tutorial |
unified commerce wikipedia |
databricks machine learning |
influxdb reviews |
sql databricks |
feature store databricks |
hire apache cassandra developer |
databrick jobs |
databricks schedule |
databricks free training |
influxdb key concepts |
databricks run |
distributed data warehouse wikipedia |
getting started with databricks |
influxdb top 10 |
delta databricks |
what is mlflow |
data warehouse consultant jobs |
databricks phone number |
databricks images |
databricks workflow |
influxdb icon |
mlflow model deployment |
databricks spark conf |
delta lake time travel |
msv app |
run databricks |
version as of databricks |
devops databricks |
hire data warehouse developers |
spark sql delta lake |
databricks unit |
delta lake on databricks |
delta lake spark version |
google vision ml |
great lakes data science reviews |
home assistant influx |
mlfow |
working at databricks |
databricks database |
databricks ingest |
databricks integrations |
exact online sql |
machine learning specialist jobs |
delta lake format |
delta lake tutorial |
mlflow review |
databricks customers |
databricks r |
databricks send email |
data warehouse technical consultant jobs |
databricks developer |
databricks icon |
spark delta lake |
databricks time travel |
influxdb tutorial for beginners |
databrics |
databrocks |
what is delta lake databricks |
exact online sql database |
influxdb as a service |
query influxdb |
mlflow gui |
share spark dataframe |
influxdb design |
spark sql case |
spark sql case when |
spark sql if |
what is delta lake |
apache spark databricks |
databricks data engineering |
mvp warehouse |
stochastic ordering |
use of databricks |
manual pitching machine |
relational transactional analysis |
spark sql format |
data warehouse testing tutorial |
influxdb minimum requirements |
sql analytics databricks |
databriks |
influxdb read data |
influxdb sql query |
agnostic learning |
databricks ceo |
influxdb explain |
learn spark sql |
managed cassandra services |
what is data bricks |
databricks inc |
databricks sql analytics |
influxdb image |
select into influxdb |
influxdb distributed |
data lake query |
is machine learning deterministic |
Let’s create the next big thing together!
Coming together is a beginning. Keeping together is progress. Working together is success.