Machine Learning and Linear Regression for Mere Mortals – InApps 2025

Main Contents:

Machine Learning and Linear Regression for Mere Mortals – InApps is an article under the topic Software Development Many of you are most interested in today !! Today, let’s InApps.net learn Machine Learning and Linear Regression for Mere Mortals – InApps in today’s post !

Key Summary

This article, part of the “Machine Learning Is Not Magic” series, introduces machine learning (ML) and linear regression in an accessible way, using familiar tools like Microsoft Excel to explain concepts. It leverages a Stack Overflow salary dataset to demonstrate supervised ML and validate assumptions with linear regression. Key points include:

Machine Learning Overview:
- Definition: Based on Tom M. Mitchell’s definition, ML is a program learning from experience (E) to improve performance (P) on a task (T) without explicit programming.
- Supervised ML: Uses labeled data (e.g., salary dataset with experience, location, and salary) to train algorithms for prediction (e.g., salary for new data).
- Unsupervised ML: Groups similar data (e.g., clustering resumes into roles like architects or developers) without specific predictions.
- Focus: The series emphasizes supervised ML due to its simplicity for beginners.
Linear Regression Explained:
- Context: Uses the Stack Overflow dataset where features (e.g., years of experience) predict a label (salary).
- Simple Linear Regression: Predicts a label using one feature, expressed as y = a + bx, where:
  - y = label (e.g., salary),
  - a = y-intercept (starting salary, ~$103,100),
  - b = slope (salary increase per year, ~$1,800),
  - x = feature (years of experience).
- Example: For 6 years of experience, Predicted Salary = $103,100 + $1,800 * 6 = $113,900.
- Multivariate Linear Regression: Considers multiple features (e.g., experience, location, skills) for more complex predictions (not covered in detail here).
Validation with Excel:
- Tool: Uses Excel’s LINEST function to calculate the slope and y-intercept of the dataset.
- Process: Select two cells, enter {=LINEST(C3:C23,B3:B23)}, and press Shift+Control+Enter to get slope (~$1,840.5) and y-intercept (~$103,118.6), closely matching manual assumptions.
- Purpose: Validates the linear regression model and encourages experimentation with datasets (available on GitHub).
Key Concepts:
- Features and Labels: Features (e.g., experience) are input variables; labels (e.g., salary) are the predicted outcomes.
- Algorithm: Linear regression is a mathematical formula (existing for centuries in finance/accounting) optimized for specific datasets via ML.
- Visualization: A chart illustrates the y-intercept (starting point) and slope (change rate) to predict new values.
Next Steps:
- Future installments will explore linear regression in depth, focusing on techniques to improve prediction accuracy.
InApps Insight:
- Linear regression and supervised ML offer accessible entry points for developers to leverage data-driven predictions using familiar tools like Excel.
- InApps Technology can apply these techniques to build predictive models for clients, such as salary forecasting or customer behavior analysis, integrating ML into scalable software solutions.

Read more about Machine Learning and Linear Regression for Mere Mortals – InApps at Wikipedia

You can find content about Machine Learning and Linear Regression for Mere Mortals – InApps from the Wikipedia website

This piece is the latest in a series, called “Machine Learning Is Not Magic,” covering how to get started in machine learning, using familiar tools such as Excel, Python, Jupyter Notebooks and cloud services from Azure and Amazon Web Services. Check back here each Friday for future installments.

In the previous part of this series, we attempted to solve a problem based on a few assumptions drawn from an existing dataset. We will now validate those assumptions by testing it in Microsoft Excel. It’s also time for me to map the concepts we discussed so far with the official Machine Learning terminology.

Tom M. Mitchell, a renowned American computer scientist and professor at Carnegie Mellon University, is widely quoted for his definition of Machine Learning: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.”

Let’s try to translate the above definition. Machine learning is the ability to build logic based on existing data without the need of explicit programming. Based on certain algorithms, machine learning programs can derive their logic without a programmer explicitly writing the code.

ML algorithms can deal with data that has a reasonably well-defined structure. Our Stack Overflow salary dataset has multiple columns such as category, location, experience, and salary. Though it is a smaller dataset, almost all the rows have complete data from which the algorithms can draw inferences.

The approach of learning from existing data that has been sufficiently labeled along with the known outcome is called as supervised machine learning. In the dataset that we are dealing with, the known outcome — the salary — is already populated for almost all the rows. When this data is used to train an algorithm, it can learn from existing relationship between location, experience and salary of an employee. We can then predict the salary for data that may not be available within the dataset.

On the other hand, when ML is used to identify and group new data based on an existing dataset, it is called unsupervised machine learning. For example, when we input many resumes of technology professionals to an algorithm, it may create a cluster that groups individuals with similar profiles into architects, developers, and administrators. In this case, we don’t train an algorithm with structured data. It’s also important to understand that there is no specific prediction (like salary) except the logical grouping of similar data points.

Though there are other forms of ML, these are the most widely used. Based on my personal experience, I recommend starting with supervised ML.

Given the simplicity, the scope of this introductory series is confined to supervised ML.

Within supervised ML, there are multiple techniques that can be applied to an existing dataset to either predict or classify the data. Each technique is called as an algorithm. Like any other computer science algorithm, ML algorithms are based on mathematical and statistical formulae. Interestingly, these equations and formulae existed for over a century with heavy usage in the field of finance and accounting.

When we apply large data sets to these traditional formulae, we end up getting interesting results. Machine learning helps us tweak and optimize the formulae for each unique data set.

So, it is safe to assume that machine learning is a field of computer science that combines data and mathematics to evolve programs that can predict and classify new data. The learning part of machine learning implies tweaking the formula to create assumptions that are very specific to the input dataset. As we have seen earlier, the assumptions vary with each dataset.

Let’s get back to the use case that we started with. In the Stack Overflow dataset, we have multiple columns that influence one key value — salary. In the official terminology of ML, these columns are called features. The value (salary) that we ultimately end up predicting is called the label.

Deriving Features and Labels from Stack Overflow Dataset.

In the previous lesson, we derived a few assumptions from the existing data set to guestimate the salary:

The assumption was based on a simple fact that the starting salary is $103,100 and the increase with each additional year of experience is roughly $1,800. We then went about applying this assumption to predict the salary of an individual with six years of experience.

Predicted Salary = Starting Salary + 1800 * Years

So, when we wanted to find the salary for six years of experience, the calculation would be:

$113900 = $103100 + $1800 * 6

We can now go about finding the salary for any number of years of experience, and we would get decent results. You can test this assumption with different values and checking the output.

This formula has no clue whether you are finding the salary or house prices or the rainfall. It would just work with any combination of feature and label. We can safely replace the headers, experience, and salary with x and y.

Let’s call the starting salary as a and increase in salary as b to derive a formula that looks like y = a + bx

Congratulations! You learned your first ML algorithm. The above formula is officially known as linear regression. When we use just one feature (years of experience) to find the value of a label (salary), it is called simple linear regression. Remember that other features such as skill level, location, and position also have an influence on the label (salary). When we consider all of them to predict the label, we use an algorithm called multivariate linear regression. For now, let’s stay focused on understanding simple linear regression with just one feature.

Let’s plot this data in a simple chart to visualize it better.

Technically, the starting point of y in the above chart is called y-intercept. The change in y with each increment in x is called the slope. According to linear regression, a new value of y can be found by multiplying x with the slope and adding it to the y-intercept.

It’s time for us to take our algorithm to the next level through a simple validation.

Do you know that Microsoft Excel does a linear regression? The formula for that existed in Excel for over a decade. Let’s try to implement our new algorithm in the most familiar tool that we use often.

In the above spreadsheet, x represents experience, and y represents the salary. Using an Excel formula called LINEST, we can find the y-intercept and slope of the dataset. LINEST is a function that returns an array, which corresponds to the y-intercept and slope values. We select two cells, type the formula {=LINEST(C3: C23,B3:B23)}, and hit Shift+Control+Enter to populate the selected cells with values.

Based on the first 20 rows, Excel reported 1840.5 as slope and 103118.6 as the y-intercept. Interestingly, these values are not very different from what we found from our assumptions.

With slope (b) and y-intercept (a) in place, we can go ahead and calculate the salary for a given number of years in experience.

I would recommend you to experiment with LINEST in Excel with your dataset. This will validate and clarify your understanding of linear regression. You can access the spreadsheet from GitHub.

In the next part of this tutorial, we will take a closer look at Linear Regression and learn about the techniques to make the predictions more accurate. Stay tuned!

Feature image via Pixabay.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.