Ha ha no, I never went as far as needing embeddings for a language model. MNIST is actually, you know, the very simple classification model. It’s a bit the ‘Hello World’ of machine learning. It’s a dataset of handwritten digits that you have to classify in 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
It is a good test because you can train it in minutes if not seconds even on crappy hardware and unoptimized code.
So when I’m talking about negative numbers, I am talking about negative numbers. I am talking about weights that are needed to be negative to have negative influence on the output. Like “this pixel in the center is white, so the likelihood of the number being zero decreases.”
I still don’t understand gradient descent fully, but can you explain why you think it should be replaced and with what?
Honestly, I’m just talking about it because we are being silly, but I am not sure that’s an idea I actually want to defend.
I just have this feeling that gradient descent is a good mathematical construction for what we try to achieve, but that mathematical purity maybe, just maybe, gets in the way of efficient computing. Of course, there are thousands of very competent, highly paid people who already explored that venue, so I’m pretty sure that if something better was possible and within the reach of one person, it would already have been discovered.
(Counterpoint: we routinely rediscover things that were invented in the 90s that are now good ideas now that we have very good computing)
The thing is gradient descent is used to tell you in which direction you’re supposed to move a weight to lower the loss of your results. In other words, to minimize the error of your network.
Gradient or partial derivatives are like an ideal mathematical tool to do that. We are able to derive it for a lot of functions, linear or not, and it is a well-studied mathematical object, so it really makes sense to use that.
The direction of the gradient will tell you the direction in which the parameters need to move. More precisely, the partial derivative of a given parameter will tell you if you need to increase it or lower it in order for the loss to improve.
Thing is we use the sign that’s clear but the intensity I am not sure it is that relevant because we keep fighting against things like gradient vanishing problem where very deep networks tend to have very low gradients and we compensate a lot of its problem through optimizers, choices and tricks.
I wonder if there would not be a pure computer science way of just keeping track of the direction in which you want the parameter to change.
I don’t know, maybe triple all the calculations by one tick in both directions? or just use gradients on one bit when it makes sense? Or find a function that’s very fast to compute but that just approximate gradients and that is just better than randomness at finding the sign.
Like I said, that’s just an itch to scratch. That’s not a strong conviction that there is something. But if you were to give me two weeks salary to just work on that, I would be very happy to.
What about quantum computing? That’s seems to be the next step. Much more efficient. Again, I’m totally out of my depth there as well. I only understand qubits at a very basic level. I don’t even really understand wave functions.
Wow most of this goes way over my head. I kind of understand the limiting to 8 bit method. I’ve heard about that being done elsewhere.
When you say negative numbers you’re talking about the embedding?
I still don’t understand gradient descent fully, but can you explain why you think it should be replaced and with what?
Thanks!
Ha ha no, I never went as far as needing embeddings for a language model. MNIST is actually, you know, the very simple classification model. It’s a bit the ‘Hello World’ of machine learning. It’s a dataset of handwritten digits that you have to classify in 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
It is a good test because you can train it in minutes if not seconds even on crappy hardware and unoptimized code.
So when I’m talking about negative numbers, I am talking about negative numbers. I am talking about weights that are needed to be negative to have negative influence on the output. Like “this pixel in the center is white, so the likelihood of the number being zero decreases.”
Honestly, I’m just talking about it because we are being silly, but I am not sure that’s an idea I actually want to defend.
I just have this feeling that gradient descent is a good mathematical construction for what we try to achieve, but that mathematical purity maybe, just maybe, gets in the way of efficient computing. Of course, there are thousands of very competent, highly paid people who already explored that venue, so I’m pretty sure that if something better was possible and within the reach of one person, it would already have been discovered.
(Counterpoint: we routinely rediscover things that were invented in the 90s that are now good ideas now that we have very good computing)
The thing is gradient descent is used to tell you in which direction you’re supposed to move a weight to lower the loss of your results. In other words, to minimize the error of your network.
Gradient or partial derivatives are like an ideal mathematical tool to do that. We are able to derive it for a lot of functions, linear or not, and it is a well-studied mathematical object, so it really makes sense to use that.
The direction of the gradient will tell you the direction in which the parameters need to move. More precisely, the partial derivative of a given parameter will tell you if you need to increase it or lower it in order for the loss to improve.
Thing is we use the sign that’s clear but the intensity I am not sure it is that relevant because we keep fighting against things like gradient vanishing problem where very deep networks tend to have very low gradients and we compensate a lot of its problem through optimizers, choices and tricks.
I wonder if there would not be a pure computer science way of just keeping track of the direction in which you want the parameter to change.
I don’t know, maybe triple all the calculations by one tick in both directions? or just use gradients on one bit when it makes sense? Or find a function that’s very fast to compute but that just approximate gradients and that is just better than randomness at finding the sign.
Like I said, that’s just an itch to scratch. That’s not a strong conviction that there is something. But if you were to give me two weeks salary to just work on that, I would be very happy to.
What about quantum computing? That’s seems to be the next step. Much more efficient. Again, I’m totally out of my depth there as well. I only understand qubits at a very basic level. I don’t even really understand wave functions.