Data Science and Linear Algebra
Anyone who wants to grow as a Data Scientist should know maths. The first thing which comes up is Data Science which is used in various application such as:
- Recommending a movie for you to watch on Netflix, product on an ecommmerce store etc.
- Forecasting company sales and profits
- The ithelpdesk can be automated by analysing the issues
- Suggesting a song to add to your playlist
So how does Mathematics fit into this?
Mathematics is a branch which is used from the very beginning since data collection, analysis to te final stages of interpretation and presentation..
Here I have discussed one of the branch of mathematics used in Data Science
LINEAR ALGEBRA
Linear algebra is behind all the powerful machine learning algorithms we are so familiar with. It’s all widespreading. It can open doors for anyone to understand and manipulate data , which you have never thought of before. You would also be able to code and model from scratch and . Isn’t that’s why we have immense interest in this field and we enjoy doing this? The potential to experiment and play around with our models? Consider linear algebra as the key to unlock a magical world.
We are all quite familar to the Regression models, say Linear Regression.
We start for a linear function, use it on the independent features in our dataset and predict an output for the target feature. But wait, How do you calculate the difference between the actual and predicted value. Obviously, the Loss function is used. It is just the application in the vector form
- Manhattan Distance: The distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 — x2| + |y1 — y2|
- Euclidean Distance: The length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem.
To calculate MSE, you take the difference between your predictions and the ground truth, square it, and average it out across the whole dataset.In the same way in RMSE and all other metrics we use algebra behind this.
Regularisation:
This is the another example of the application of linear algeabra. It is method through which we prevent our model from overfitting. Overfitting happens when our the training data fits very well but when the model is applied to a new data set , it causes error due to the noice in the dataset. Regularization penalizes overly complex models by adding the norm of the weight vector to the cost function.
There are different ways : Lasso Regression and Ridge Regression
A simple relation for linear regression looks like this. Here Y represents the learned relation and β represents the coefficient for different features
Y ≈ β0 + β1X1 + β2X2 + …+ βpXp.The coefficients are chosen in such a way that they minimize this loss function.
Now, this will adjust the coefficients based on your training data. If there is noise in the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.
Ridge Regression
Above image shows ridge regression, where the RSS is modified by adding the shrinkage quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of our model. The increase in flexibility of a model is represented by increase in its coefficients, and if we want to minimize the above function, then these coefficients need to be small. This is how the Ridge regression technique prevents coefficients from rising too high. Also, notice that we shrink the estimated association of each variable with the response, except the intercept β0, This intercept is a measure of the mean value of the response when xi1 = xi2 = …= xip = 0.
When λ = 0, the penalty term has no effect, and the estimates produced by ridge regression will be equal to least squares. However, as λ→∞, the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero. As can be seen, selecting a good value of λ is critical. Cross validation comes in handy for this purpose. The coefficient estimates produced by this method are also known as the L2 norm.
The coefficients that are produced by the standard least squares method are scale equivariant, i.e. if we multiply each input by c then the corresponding coefficients are scaled by a factor of 1/c. Therefore, regardless of how the predictor is scaled, the multiplication of predictor and coefficient(Xjβj) remains the same. However, this is not the case with ridge regression, and therefore, we need to standardize the predictors or bring the predictors to the same scale before performing ridge regression. The formula used to do this is given below.
Lasso
Lasso is another variation, in which the above function is minimized. Its clear that this variation differs from ridge regression only in penalizing the high coefficients. It uses |βj|(modulus)instead of squares of β, as its penalty. In statistics, this is known as the L1 norm.
Lets take a look at above methods with a different perspective. The ridge regression can be thought of as solving an equation, where summation of squares of coefficients is less than or equal to s. And the Lasso can be thought of as an equation where summation of modulus of coefficients is less than or equal to s. Here, s is a constant that exists for each value of shrinkage factor λ. These equations are also referred to as constraint functions.
Consider their are 2 parameters in a given problem. Then according to above formulation, the ridge regression is expressed by β¹² + β²² ≤ s. This implies that ridge regression coefficients have the smallest RSS(loss function) for all points that lie within the circle given by β¹² + β²² ≤ s.
Similarly, for lasso, the equation becomes,|β1|+|β2|≤ s. This implies that lasso coefficients have the smallest RSS(loss function) for all points that lie within the diamond given by |β1|+|β2|≤ s.
Covariance Matrix:
When we are performing EDA and visualising our dataset, we want to know the relationship between two variables. Covariance is the measure used to study the relationship between two variables. But now we will think this is statistics , how linear algebra is involved in this. But then I told you before that linear algebra is a key to unlock everything. Using the concept of transpose and matrix multiplication we can calculate the covariance.
SVM
Support Vector Machine (SVM) is one of the most powerful out-of-the-box supervised machine learning algorithms. It is the application of the vector spaces in Linear Algebra.
In this algorithm, we plot each data item as a point in an n-dimensional space (where n is the number of features we have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyperplane that differentiates the two classes very well .
The vectors (cases) that define the hyperplane are the support vectors.
Algorithm
- Define an optimal hyperplane: maximize margin
- Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications.
- Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space.
To define an optimal hyperplane we need to maximize the width of the margin (w).
We find w and b by solving the following objective function using Quadratic Programming.
The beauty of SVM is that if the data is linearly separable, there is a unique global minimum value. An ideal SVM analysis should produce a hyperplane that completely separates the vectors (cases) into two non-overlapping classes. However, perfect separation may not be possible, or it may result in a model with so many cases that the model does not classify correctly. In this situation SVM finds the hyperplane that maximizes the margin and minimizes the misclassifications.
The algorithm tries to maintain the slack variable to zero while maximizing margin. However, it does not minimize the number of misclassifications (NP-complete problem) but the sum of distances from the margin hyperplanes.
The simplest way to separate two groups of data is with a straight line (1 dimension), flat plane (2 dimensions) or an N-dimensional hyperplane. However, there are situations where a nonlinear region can separate the groups more efficiently. SVM handles this by using a kernel function (nonlinear) to map the data into a different space where a hyperplane (linear) cannot be used to do the separation. It means a non-linear function is learned by a linear learning machine in a high-dimensional feature space while the capacity of the system is controlled by a parameter that does not depend on the dimensionality of the space. This is called kernel trick which means the kernel function transform the data into a higher dimensional feature space to make it possible to perform the linear separation.
Map data into new space, then take the inner product of the new vectors. The image of the inner product of the data is the inner product of the images of the data.
PCA
This is the algorithm used to for dimension reduction. PCA finds the directions of maximum variance and projects the data along them to reduce the dimensions. The direction here are the eigenvectors of the covariance matrix. An eigenvector or characteristic vector of a linear transformation is a nonzero vector that changes by a scalar factor when that linear transformation is applied to it.( direction does not chnage)
Dealing with textual data
Machine learning algorithms cannot work with raw textual data. We need to convert the text into some numerical and statistical features to create model inputs. There are many ways for engineering features from text data, such as:
- Meta attributes of a text, like word count, special character count, etc.
- NLP attributes of text
- Word Vector Notations or Word Embeddings
Word Embeddings is a way of representing words as low dimensional vectors of numbers while preserving their context in the document. These representations are obtained by training different neural networks on a large amount of text which is called a corpus. They also help in analyzing syntactic similarity among word.
Dealing with images
The image is a beautiful representation of the neural network, but how a computer can understand this. In a computer, the layers of the neural network are represented as vectors. A digital image is a two-dimensional discrete function, 𝑓(𝑥, 𝑦) with (𝑥, 𝑦) ∈ 𝕫 2 , where each pair of coordinates is called a pixel. The word pixel derived from English “picture element” determines the smallest logical unit of visual information used to construct an image. Without loss of generality one image can be represented by a matrix where each element 𝑖𝑗 corresponds to the value of the pixel image position (𝑖,𝑗).
These are the some of the applications of Linear Algebra in Data Science and these are just some as i said, there are a lot more.
Thank You,
Author: Pragya Sinha