Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8156053
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 6, 20262026-06-06T16:50:20+00:00 2026-06-06T16:50:20+00:00

I’m working on machine learning problem and want to use linear regression as learning

  • 0

I’m working on machine learning problem and want to use linear regression as learning algorithm. I have implemented 2 different methods to find parameters theta of linear regression model: Gradient (steepest) descent and Normal equation. On the same data they should both give approximately equal theta vector. However they do not.

Both theta vectors are very similar on all elements but the first one. That is the one used to multiply vector of all 1 added to the data.

Here is how the thetas look like (fist column is output of Gradient descent, second output of Normal equation):

Grad desc Norm eq
-237.7752 -4.6736
-5.8471   -5.8467
9.9174    9.9178
2.1135    2.1134
-1.5001   -1.5003
-37.8558  -37.8505
-1.1024   -1.1116
-19.2969  -19.2956
66.6423   66.6447
297.3666  296.7604
-741.9281 -744.1541
296.4649  296.3494
146.0304  144.4158
-2.9978   -2.9976
-0.8190   -0.8189

What can cause the difference in theta(1, 1) returned by gradient descent compared to theta(1, 1) returned by normal equation? Do I have bug in my code?

Here is my implementation of normal equation in Matlab:

function theta = normalEque(X, y)
    [m, n] = size(X);
    X = [ones(m, 1), X];
    theta = pinv(X'*X)*X'*y;
end

Here is code for gradient descent:

function theta = gradientDesc(X, y)
    options = optimset('GradObj', 'on', 'MaxIter',  9999);
    [theta, ~, ~] = fminunc(@(t)(cost(t, X, y)),...
                    zeros(size(X, 2), 1), options);
end

function [J, grad] = cost(theta, X, y)
    m = size(X, 1);
    X = [ones(m, 1), X];
    J = sum((X * theta - y) .^ 2) ./ (2*m);
    for i = 1:size(theta, 1)
        grad(i, 1) = sum((X * theta - y) .* X(:, i)) ./ m;
    end
end

I pass exactly the same data X and y to both functions (I do not normalize X).

Edit 1:

Based on answers and comments I checked few my code and run some tests.

First I want to check if the problem can be cause by X beeing near singular as suggested by @user1489497’s answer. So I replaced pinv by inv – and when run it I really got warning Matrix is close to singular or badly scaled.. To be sure that that is not the problem I obtained much larger dataset and run tests with this new dataset. This time inv(X) did not display the warning and using pinv and inv gave same results. So I hope that X is not close to singular any more.

Then I changed normalEque code as suggested by woodchips so now it looks like:

function theta = normalEque(X, y)
    X = [ones(size(X, 1), 1), X];
    theta = pinv(X)*y;
end

However the problem is still there. New normalEque function on new data that are not close to singular gives different theta as gradientDesc.

To find out which algorithm is buggy I have run linear regression algorithm of data mining software Weka on the same data. Weka computed theta very similar to output of normalEque but different to the output of gradientDesc. So I guess that normalEque is correct and there is a bug in gradientDesc.

Here is comparison of thetas computed by Weka, normalEque and GradientDesc:

Weka(correct) normalEque    gradientDesc
779.8229      779.8163      302.7994
  1.6571        1.6571        1.7064
  1.8430        1.8431        2.3809
 -1.5945       -1.5945       -1.5964
  3.8190        3.8195        5.7486
 -4.8265       -4.8284      -11.1071
 -6.9000       -6.9006      -11.8924
-15.6956      -15.6958      -13.5411
 43.5561       43.5571       31.5036
-44.5380      -44.5386      -26.5137
  0.9935        0.9926        1.2153
 -3.1556       -3.1576       -1.8517
 -0.1927       -0.1919       -0.6583
  2.9207        2.9227        1.5632
  1.1713        1.1710        1.1622
  0.1091        0.1093        0.0084
  1.5768        1.5762        1.6318
 -1.3968       -1.3958       -2.1131
  0.6966        0.6963        0.5630
  0.1990        0.1990       -0.2521
  0.4624        0.4624        0.2921
-12.6013      -12.6014      -12.2014
 -0.1328       -0.1328       -0.1359

I also computed errors as suggested by Justin Peel’s answer. Output of normalEque gives slightly lesser squared error but the difference is marginal. What is more when I compute gradient of cost of theta using function cost (the same as the one used by gradientDesc) I got gradient near zero. Same done on output of gradientDesc does not give gradient near zero. Here is what I mean:

>> [J_gd, grad_gd] = cost(theta_gd, X, y, size(X, 1));
>> [J_ne, grad_ne] = cost(theta_ne, X, y, size(X, 1));
>> disp([J_gd, J_ne])
  120.9932  119.1469
>> disp([grad_gd, grad_ne])
  -0.005172856743846  -0.000000000908598
  -0.026126463200876  -0.000000135414602
  -0.008365136595272  -0.000000140327001
  -0.094516503056041  -0.000000169627717
  -0.028805977931093  -0.000000045136985
  -0.004761477661464  -0.000000005065103
  -0.007389474786628  -0.000000005010731
   0.065544198835505  -0.000000046847073
   0.044205371015018  -0.000000046169012
   0.089237705611538  -0.000000046081288
  -0.042549228192766  -0.000000051458654
   0.016339232547159  -0.000000037654965
  -0.043200042729041  -0.000000051748545
   0.013669010209370  -0.000000037399261
  -0.036586854750176  -0.000000027931617
  -0.004761447097231  -0.000000027168798
   0.017311225027280  -0.000000039099380
   0.005650124339593  -0.000000037005759
   0.016225097484138  -0.000000039060168
  -0.009176443862037  -0.000000012831350
   0.055653840638386  -0.000000020855391
  -0.002834810081935  -0.000000006540702
   0.002794661393905  -0.000000032878097

This would suggest that gradient descent simply did not converge to global minimum… But that is hardly the case as I run it for thousands of iterations. So where is the bug?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-06T16:50:21+00:00Added an answer on June 6, 2026 at 4:50 pm

    I finally had time to get back to this. There is no “bug”.

    If the matrix is singular, then there are infinitely many solutions. You can choose any solution from that set, and get equally as good an answer. The pinv(X)*y solution is a good one that many like because it is the minimum norm solution.

    There is NEVER a good reason to use inv(X)*y. Even worse, is to use inverse on the normal equations, thus inv(X’*X)*X’*y is simply numerical crap. I don’t care who told you to use it, they are guiding you to the wrong place. (Yes, it will work acceptably for problems that are well-conditioned, but most of the time you don’t know when it is about to give you crap. So why use it?)

    The normal equations are in general a bad thing to do, EVEN if you are solving a regularized problem. There are ways to do that that avoid squaring the condition number of the system, although I won’t explain them unless asked as this answer has gotten long enough.

    X\y will also yield a result that is reasonable.

    There is ABSOLUTELY no good reason to throw an unconstrained optimizer at the problem, as this will yield results that are unstable, completely dependent on your starting values.

    As an example, I’ll start with a singular problem.

    X = repmat([1 2],5,1);
    y = rand(5,1);
    
    >> X\y
    Warning: Rank deficient, rank = 1, tol =  2.220446e-15. 
    ans =
                             0
             0.258777984694222
    
    >> pinv(X)*y
    ans =
             0.103511193877689
             0.207022387755377
    

    pinv and backslash return slightly different solutions. As it turns out, there is a basic solution, to which we can add ANY amount of the nullspace vector for the row space of X.

    null(X)
    ans =
             0.894427190999916
            -0.447213595499958
    

    pinv generates the minimum norm solution. Of all of the solutions that might have resulted, this one has minimum 2-norm.

    In contrast, backslash generates a solution that will have one or more variables set to zero.

    But if you use an unconstrained optimizer, it will generate a solution that is completely dependent on your starting values. Again, ANLY amount of that null vector can be added to your solution, and you still have an entirely valid solution, with the same value of the sum of squares of errors.

    Note that even though no singularity waring is returned, this need not mean your matrix is not close to singular. You have changed little about the problem, so it is STILL close, just not enough to trigger the warning.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I want use html5's new tag to play a wav file (currently only supported
I have a French site that I want to parse, but am running into
I have a .ini file as follows: [playlist] numberofentries=2 File1=http://87.230.82.17:80 Title1=(#1 - 365/1400) Example
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have just tried to save a simple *.rtf file with some websites and
I want to count how many characters a certain string has in PHP, but
I am trying to understand how to use SyndicationItem to display feed which is
I have a jquery bug and I've been looking for hours now, I can't
this is what i have right now Drawing an RSS feed into the php,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.