Mental-Swordfish7129

Mental-Swordfish7129 t1_j3q81e2 wrote

Reply to comment by jimmymvp in [N] What's next for AI? by vsmolyakov

Active just means that it directly modifies its input stream. And, yes, it is also predicting what that input will be, so it is reasonable to say that it is, in part, self-predictive.

Crucially, its input stream also includes features that are not itself or have not been changed by itself. The proprioceptive signals help it learn which is which.

1

Mental-Swordfish7129 t1_j3q7lbz wrote

Reply to comment by jimmymvp in [N] What's next for AI? by vsmolyakov

I don't think this model is within the realm ML (it's theoretical neuroscience; although there is much overlap) but does qualify as AI which is what was asked about in the post title.

There is an annual symposium called the International Workshop on Active Inference for about 3 years now where research is presented and the papers are linked there.

And of course the dozens of research papers you can find through Google Scholar on the topic.

Edit: I did find where a few active inference papers have been presented at NeurIPS.

1

Mental-Swordfish7129 t1_j3q6p7m wrote

Reply to comment by jimmymvp in [N] What's next for AI? by vsmolyakov

The model is generative. Each layer generates predictions about the patterns of the layers below. The bottom layer generates predictions about the sensory data, some of which is proprioception data.

I have never published anything. I do not have that much time and it would largely be redundant. You can look at Friston, et.al. for the math. I use nearly the same math and logic.

What I'm doing bears only a superficial similarity to Gato in my opinion, but I can't say I've looked into it deeply. I've been far too busy with life. I only have my tiny spare time for this project unfortunately.

1

Mental-Swordfish7129 t1_j3q1hej wrote

Reply to comment by _xenoschema in [N] What's next for AI? by vsmolyakov

It's a model that "chooses" its input stream from a 2d array of sensor data (cam, mics, and servo encoders) in real time using policies decoded from predictions of the bottom layer. Then, it processes this input up the hierarchy of identical layers. Higher layer predictions are used to modulate attention.

It may qualify as a general intelligence (idk) as any data can be encoded into the format of its input stream. What I mean is that I have a particular way of encoding video, audio, anything really, into a universal format which preserves the salient semantics.

Currently, it is greatly inhibited in what it can learn because I cannot feed it experiences at the rate it could take them. It has far more potential than realized knowledge.

1

Mental-Swordfish7129 t1_j3leaf6 wrote

Reply to comment by jimmymvp in [N] What's next for AI? by vsmolyakov

Here's a fairly accessible free e-book by the principal researcher on the topic, Karl Friston...

https://mitpress.mit.edu/9780262045353/active-inference/

He's got tons of papers. He's one of the most cited scientists alive.

Also, there are lectures and such on YouTube. Just search terms "free energy principle", "active inference", "predictive processing".

Some other good books are "Surfing Uncertainty" by Andy Clark and "The Predictive Mind" by Howhy.

5

Mental-Swordfish7129 t1_j3l6yqj wrote

I think the field may eventually move in the direction of online unsupervised generative models implementing something akin to the free energy principle and active inference. These are the kinds of models I am developing and they seem to circumvent many of the issues with modern SOTA ML. I figure it will be a while before this happens because it seems that there isn't a lot of interest yet from the typical sources.

2

Mental-Swordfish7129 t1_j3bbbpn wrote

>I've tried it too, I admit. You go from "I think it's doable" to "hell no, this isn't ever gonna work" in a couple of hours, lol.

I've been at it for around 12 years in my little free time and I've made fairly steady progress excluding a few setbacks. I think I must have gotten very lucky many times. I know that when I look at my approach back then, that I was wayyy off and very ignorant and ridiculous.

2

Mental-Swordfish7129 t1_j2y3l18 wrote

>I usually see attention in PP implemented, conceptually at least, as variance parameterisation/optimisation over a continuous space.

Continuous spaces are simply not necessary for what I'm doing. I avoid infinite precision because there is little need for precision beyond a certain threshold.

Also, I'm just a regular guy. I do this in my limited spare time and I only have relatively weak computational resources and hardware. I'm trying to be more efficient anyway; like the brain. It makes it all very efficient because there is not a floating point operation in sight.

Discrete space works just fine and there is no ambiguity possible for what a particular index of the space represents. In a continuous space, you'd have to worry that something has been truncated or rounded away.

Idk. Maybe my reasons are ridiculous.

2

Mental-Swordfish7129 t1_j2xrr7a wrote

>How do you achieve something similar in your binary latent space?

All data coming in is encoded into these high-dimensional binary vectors where each index in a vector corresponds to a relevant feature in the real world. Then, computing error is as simple as XOR(actual incoming data, prediction). This preserves the semantic details of how the prediction was wrong.

There is no fancy activation function. A simple sum of all connected synapses which connect to an active element.

Synapses are binary. Connected or not. They decay over time and their permanence is increased if they're useful often enough.

3

Mental-Swordfish7129 t1_j2xqlwa wrote

The really interesting thing as of late is that if I "show" the agent, as part of its input, its global error metric alongside forcing (moving the reticle directly) it out of boredom toward higher information gain, I can eventually stop the forcing because it learns to force itself out of boredom. It seems to learn the association between a rapidly declining error and a shift to a more interesting input. I just have to facilitate the bootstrapping.

It eventually exhibits more and more sophisticated behavioral sequences (higher cycle before repeating) and the same at higher levels with the attentional changes.

All layers perform the same function. They only differ because of the very different "world" to which they are exposed.

3

Mental-Swordfish7129 t1_j2x3juw wrote

Idk if it's in the literature. At this point, I can't tell what I've read from what has occurred to me.

I keep track of the error each layer generates and also a brief history of its descending predictions. Then, I simply reinforce the generation of predictions which favor the highest rate of reduction in subsequent error. I think this amounts to a modulation of attention (manifested as a pattern of bit masking of the ascending error signal) which amounts to ignoring the portions of the signal which have low information and high variance.

At the bottom layer, this is implemented as choosing behaviors (moving a reticle over an image u,d,l,r) which accomplish the same avoidance of high variance and thus high noise, but seeking high information gain.

The end result is a reticle which behaves like a curious agent attempting to track new, interesting things and study them a moment before getting bored.

The highest layers seem to be forming composite abstractions on what is happening below, but I have yet to try to understand.

I'm fine with questions.

3

Mental-Swordfish7129 t1_j2v20d2 wrote

That's amazing. We probably haven't fully realized the great powers of analysis we have available using Fourier transform and wavelet transform and other similar strategies.

9

Mental-Swordfish7129 t1_j2twm92 wrote

This is the big deal. Interpretability is so important and I think it will only become more desirable to understand the details of these models we're building. This has been an important design criterion for me as well. I feel like I have a deep intuitive understanding of the models I've built recently and it has helped me improve them rapidly.

20

Mental-Swordfish7129 t1_j2tubij wrote

It's pretty vanilla.

Message passing up is prediction error.

Down is prediction used as follows:

I use the bottom prediction to characterize external behavior.

Prediction at higher levels characterizes attentional masking and other alterations to the ascending error signals.

3

Mental-Swordfish7129 t1_j2ta6bw wrote

I know right. It happens over and over. Someone's great idea gets overlooked or forgotten and then later some people declare the idea "new" and the fanfare ensues. If you're not paying close attention, you won't notice that often the true innovation is very subtle. I'm not trying to put anyone down. It's common for innovation to be subtle and to rest on many other people's work. My model rests on a lot of brilliant people's work going all the way back the early 1900s

20

Mental-Swordfish7129 t1_j2t22qq wrote

The biggest reason I use this encoding is because of the latent space it creates. My AI models are of the SDM variety with a predictive processing architecture computing something very similar to active inference. This encoding allows for complete universality and the latent space provides for the generation of semantically relevant memory abstractions.

8

Mental-Swordfish7129 t1_j2t17wy wrote

Idk much about other encoding systems. This works well for my purposes. It's scalable. I look at my data and ask, "how many binary features of each datum are salient and also which features are important to the model for judging similarities"? 2000 may be too much sometimes. Also, remember that a binary vector is often handled as an integer array indicating the index of bits set to 1. If your vectors are sparse it can be very efficient. For the AI models I build, my vectors are often quite sparse because I often use a scheme like a "slider" of activations for integer data; sort of like "one hot", but I'll do three or more consecutive to encode associativity.

10

Mental-Swordfish7129 t1_j2s6xlg wrote

Interesting. I've had success encoding the details of words (anything, really) using high-dimensional binary vectors. I use about 2000 bits for each code. It's usually plenty as it is often difficult to find 2000 relevant binary features of a word. This is very efficient for my model and allows for similarity metrics and instantiates a truly enormous latent space.

52

Mental-Swordfish7129 t1_j2llbwf wrote

What if you encode the data with high-dimensional binary vectors and utilize a sparse distributed memory? I've used this approach many times with models I've built and you can measure semantic (Hamming) distance between data and you have a latent space for what similar data would have to look like. It's similar to a self-organizing map approach.

1

Mental-Swordfish7129 t1_j28g001 wrote

>There is no inherent reason why human behavior cannot be modelled algorithmically using computers.

I think we can make an even stronger claim... If we examine a "behavior" we see that it is only a behavior because the relevant axons happen to terminate at an end effector like muscle tissue. If these same axons were transposed to instead terminate at other dendrites, we might label their causal influence an attentional change or a "shifting thought". So, by extending your argument, there is no good reason to suspect we cannot model ANY neural process whatsoever. This is how causal influence proceeds in the model I have created. It's a stunning thing to observe.

2

Mental-Swordfish7129 t1_j28a9fj wrote

I have an AI model I've been working on for some time now which I believe may be much more interpretable than many recent developments. It utilizes Bayesian model evidence as a core function anyway, so it has already "prepared" evidence and an explanation of sorts for why it "believes" what it "believes". This has made for an interesting development process as I can observe its reasoning evolve. I could elaborate if you're interested.

3