1. Introduction

1.1 Agents and Environments

An agent-based system is comprised of two main functions, the agent function A(s) and the environment function E(a), where s is an observable state of the environment and a is an action selected by the agent. The input received by each function at any given time is the output of the other function at the previous time step.

Inherent to agent-based systems is the assumption that actions taken by the agent are selected in response to the observed state at any given time, and that the following observations are in some way related to the actions carried out prior to the current time step. In other words, there is assumed causality regarding the environmental states and the agent’s behavior, both in the sense that an agent selects actions in accordance with the environment, and the actions carried out by an agent have some effect, that is to say trigger some change, in the environment.

1.2 Cause and Effect

This symbiotic-like relationship between agent and environment makes it necessary to have a formal representation of causality within the system. Agents must have the ability to model cause-and-effect, specifically regarding their behavior with respect to an “initial state” of the environment and the way in which certain actions gives rise to different “final states”. This type of information is represented by “if-then” data structures, ordered pairs of observations in which the occurrence of one observation is assumed to bare some responsibility for the occurrence of the second.

When used in the context of an environment function, if-thens may be referred to as “cause-effect” structures, due to the fact that events in the environment tend to be more fixed than events in the agent, which is to say that the mapping between inputs and outputs of the environment is relatively constant when compared to the agent function, where the response to a given input may depend on additional information such as context or current goals. In any case, the system uses these data structures to model sequences of events and observations through time.

1.3 Change and Comparison

Agents are automatically driven to detect differences across space and time, and thus perceive their environment according to measurements of change. A measurement is simply a comparison between two things in order to find the difference. An agent can also compare two measurements, enabling comparison of information regardless of scale and without limit. It can be said that an agent’s “attention” is naturally drawn to change, and that its “knowledge” is comprised of these measurements.

1. 4 Goals and Performance

Agents adapt their behavior in response to utility ratings, which gauges how well a previous set of actions carried out by the agent assisted in achieving some set of goals. Utility is a measure of performance, either from the external environment or within the agent itself. External utility ratings can always be broken down and described in terms of internal utility, where the internal utility ratings of an agent are directly related to a set of goals and the external utility ratings are indirectly related to said goals. For example, a paycheck is an external utility rating which indirectly relates to internal goals like hunger, shelter, and leisure. External utility can be thought of as a means to an end, i.e. the internal goals of an agent.

When an agent is given a negative utility rating, the action or set of actions carried out in the previous time steps are incrementally decoupled from the observations which triggered them, effectively decreasing the probability that the agent will respond in the same way given a future instance of the same observations. In other words, the agent’s response to the observations in the future are less likely to be the same as in the past, due to the fact that they were previously unsuccessful and thus bad for the agent with respect to it’s goals. By adapting its behavior to maximize utility, an agent “learns” how to behave in order to achieve a set of objectives.

2. Learning

2.1 States and Actions

Inputs from the environment are received as input vectors; subsets of the input space. In order to reduce the dimensionality of observations, pattern recognition is applied to convert the input vectors into feature vectors, and then into states. A state is simply a combination of features which can be detected following the conversion of a given input vector into a feature vector. The result is a state space with a lower dimensionality relative to the input space. The same process occurs at the other end, where output vectors are converted into feature vectors in order to get a lower-dimensional action space. In each case, patterns are learned with a frequency-based unsupervised algorithm for the purpose of dimensionality reduction.

2. 2 Utility and Expectation

Relationships between states and actions are established in multiple ways. First, a given state is linked to certain actions that are triggered when instances of said state are observed. This results in a predictable response by the agent whenever a given state is observed. Second, a given action is linked to certain events that occur when said action is carried out. This results in a predictable state transition whenever a given action is performed.

Connections between states and actions that dictate the triggered response by an agent are learned through utility maximization, while connections that dictate the predicted state transitions are learned through expectation maximization. Both learning methods use a reinforcement algorithm to optimize their respective function within the system.

3. Decision Making

3.1 Choice and Predictions

An action taken in a particular state has an associated utility based on the expected state transition caused by performing said action. Therefore an agent can take the set of potential states which may be reached via actions in the current state, and use the expected utility ratings associated with each state to rank the set of possible options at any given time step. The fundamental logic behind this is that any decision can be made about the best known action to take with regard to its associated utility.

agents use their knowledge regarding actions and the associated state transitions to model predictions about future states given some initial state. The prediction function P(s, a) takes an initial state with an action, and returns the expected state caused by taking said action in said state. The expected state can then be fed back into P as an initial state, thus creating a prediction chain that spans multiple time steps into the future. In this sense, predictions are recursive and theoretically limitless in the sense that any output of P can then be passed to P, however the accuracy of each prediction decreases as the prediction chain grows in size, since each new prediction takes place further into the future and thus is predicated on earlier predictions which have yet to be verified.

3.2 Selection and Reasoning

An agent’s ability to make predictions about future states enables it to make predictions about future utilities, and therefore to rank the possible actions with respect to the utilities associated with each actions resulting state transition. This process enables a form of reasoning that assists in selecting the best possible option at any given time. Reasoning offers a more rigorous (and therefore costly) method of action selection as an alternative to the deterministic (and faster) decision-making process using if-then knowledge, which does not provide information about possible outcomes of actions but rather causes actions to trigger in response to certain states which have adapted to previous utility ratings.

While this reactionary decision-making process is adequate for most simple cases, and has considerable advantages over reasoning in terms of execution time, it fails dramatically in cases when the decision-making process requires additional knowledge beyond immediate state observations to perform effectively, or any problem that lacks an immediately clear-cut answer. Since both deterministic and reason-based methods of action selection have pros and cons, they can be used jointly where the default method of action selection is deterministic and switches to reasoning when faced with incomplete knowledge or consistently poor utility ratings. The result of using both methods for action selection, switching between them when necessary, is more optimal than either the deterministic or reason-based method alone. The strengths of each tend to make up for the shortcoming of the other, effectively yielding a greater overall performance of the action selection strategy.

4. Goal Orientation

4.1 Discounts and Conditions

Utilities are discounted based on three main conditions. The three main types of utility discount are repetitive, predictive, and competitive discount. A Repetitive discount occurs when a state yields a near constant utility over a sequence of time steps. It rapidly decays until it has little influence over the agent, who then switches tasks allowing the state to “restore” itself and reset its discount. Repetitive discounts promote task-switching and novelty, driving the agent toward unknown experiences in which new knowledge is obtained.

A predictive discount is applied to expected utilities of possible future state, relative to the number of time steps until said state is predicted to occur. This results in an immediacy when dealing with current situations, as opposed to the response to future events which are less urgent. Finally, a competitive discount is applied when multiple utilities are ‘competing’, or simultaneously present and in conflict with one another. The more important utility is selected and discounts the other, effectively maximizing its Perceived importance by the agent.

4.2 Drive and Equilibrium

Agents have internal variables that must remain at equilibrium. Each variable has a desired value, and when it is not at that value it creates a drive. A drive prompts the agent to act in a way that returns the variable to its desired value, thus reducing the drive. When an action reduces a drive, a reward is given that increases the probability of responding in the same way to future instances of that drive. Over time, different drives become associated with actions that yield the highest reward. This means that the optimal response to a given drive is more likely to occur as the agent uses trial and error.

Each time a drive occurs, the observations leading up to it are connected in a way such that future instances of the observations trigger a prediction of that drive. Over time, correct predictions are rewarded and the most accurate connections from event-to-drive are found. Once a given drive is accurately predicted and successfully responded to, the associated event is connected to the associated action, causing the prediction of a drive to trigger the correct response before it even occurs.