TITLE: Introducing Constructive Contextual Reinforcement Learning

TITLE: Introducing Constructive Contextual Reinforcement Learning [CCRL]

AUTHOR: gdh
COMMENT: jh
VERSION: V 1.0.1
FIRST: 30.11.98 18:35
LAST: 11.12.98 17:25

References:

KAELBLING, Leslie Pack/ LITMAN, Michael L./ MOORE, Andrew W. [1996] Reinforcement Learning: A Survey

http://www.cs.brown.edu/people/lpk/rl-survey.html

CONTENT:

LTD and Learning
Reinforcement Learning
Problems with the Reinforcement Learning Paradigm

3.1 The Non-Behavioral Learning Concept

3.2 External Reward

Constructive Contextual Reinforcement Learning
Behavioral and Cognitive Learning Concepts

5.1 Behavioral Learning Concepts

5.2 Learning with Neural Nets based on the INM-Neuron

5.3 Cognitive Learning Concepts

LTD and Learning

LTD-R centering its focus on selflearning agents -human as well as transhuman agents- in the domain of knowledge management.

Learning is seen here primarily from a behavioral point of view, but in parallel also from a physiological point of view with a phenomenological perspective as a possible third dimension.

To realize this task one has succesively to set up a formal model of the environment in which learning is assumed to happen and localized in agents interacting with this environment. Besides this it has to be eleborated a formal model of the internal processes of these acting agents in correlation with the observable interactions and environmental states.

The general outline of such a formal theory together with methods of measurements and computational models is described in another paper (see: "LTD-R I+II Outline of Research").

To specify a bit more concretely with which kind of learning paradigm we are working we will describe a first paradigm which is induced by a discussion of the Reinforcement Learning Paradigm, for which KAELBLING et al. (1996) are giving an exciting overview.

Reinforcement Learning

The general outline of the reinforcement paradigm can be represented by the following diagram:

The ENVIRONMENT contains states which can be finite or infinite, discrete or continous.
The states can be changed by STATE TRANSITIONS which are connected to a TRANSITION PROBABILITY.
The states are STATIONARY or not STATIONARY.
There is some 'perception' of actual states and state transitions by an INPUT function, which yields a state description.
There is a special input r called REINFORCEMENT (or REWARD). Reinforcement can happen IMMEDIATELY or DELAYED (general case: r in REAL).
There is an 'output'/ 'reponse' of the learner to the environment, which is called an ACTION. The number of actions can be finite or infinite. Actions can be DISCRETE or CONTINOUUS. Every action changes the state of the environment and causes a reinforcement value be sent to the learner.
Typical mappings are the following ones:

Value functions: S --> REAL (associating states with rewards; credit assigment problem)

Reinforcement functions/ Q-functions: S x A --> REAL (associating staes and actions with rewards)

Policy: S --> A (mapping states into actions)

Deterministic transitions: S x A --> S

State Transition probabilities: S x A x S --> [0,1]

Experience c S x A x S' x R

A MODEL consists of knowledge of the state transition probability function T(s,a,s‘) and the reinforcement function R(s,a)
The GOAL of the learner ist to maximize the long-run measure of reinforcement.
To achieve the goal the learner uses an INTERNAL MODEL with a horizon (number of steps 'looked ahead'= finite/ infinite)
The most common used models to compute the EXPECTED REWARD are the following:

finite-horizon model: E(SUM(i=0, h)(r_i)) /* summation of the reinforcement values r_i in the next h steps */

infinite-horizon discounted model: E(SUM(i=0, ¥ )(gamma_i * r_i) (0 < gamma < 1) /* As farther as the step is in the future are weigthed by the discount factor gamma)*/

average-reward model (also: gain optimal policiy): lim(h -> ¥ ) E(1/h * SUM(i=0, h)(r_i)) /* The arithmetic mean of all reinforcement values extended to some future. To be more sensitive for a extra initial reward it is often modified to the bias optimal model, which maximizes the long-run average but is sensitive for an extra reward in the beginning. */

MEASURING LEARNING PERFORMANCE: (i) Eventual convergence to optimal behavior (useless in practical terms); (ii.1) Speed of convergence to optimality (on account of the asymptotic character of the result this is ill-defined); (ii.2) Speed of convergence to near-optimality (this begs the definition of how near optimality is sufficient); (ii.3) Level of performance after a given time (This also begs the defintion what is an appropriate time). /* All measurements relying on ‚speed‘ have the general weakness that an algorithm which tends to accept large penalties can receive good values. / (iii) Regret: measures the expected decrease in reward gained. It penalizes mistakes wherever they occur during the run /* Results are hard to get */.

Problems with the Reinforcement Learning Paradigm

Especially two problems with the reinforcement paradigm shall be mentioned here. The one is related to the methodological problem of using a learning concept, which is not strictly 'behavioral'. And the other one is related to the assumption that the reward is coming from the environment.

The Non-behavioral Learning Concept

The non-behavioral character of the central learning concept within reinforcement learning is revealed by the following facts: within reinforcement learning is the term 'learning' bound to the goal of maximizing the amount of reward 'in the long run'. This goal of 'maximizing reward' is connected to different procedures how to obtain this maximum. These goal-obtaining procedures are a mixture of exploring alternatives, evaluation of alternatives, selecting a next step, and this repeated several times for a certain period of time. Procedures which are leading in this way to the optimal value of reward are called 'learning procedures'.

From a methodological point of view induces a concept like the 'optimal value of reward' with regard to a certain individual system some problems.

The 'maximum reward' is not a behavioral concept! It is related to some internal states of an individual system. One consequence of this is that the same environment can be seen quite differently depending from the individual settings of learning. Although this 'individualization' may be useful for certain theoretical investigations it is not useful with regard to the challenge to define a 'learning task' without relying on special conditions of the learner. What should count here is the fact that a certain learner L is able to solve a certain task within a finite periode of time without beeing forced to include the individual inner states of a learner L.

This is one reason why in LTD the term 'learning' is primarily not bound to the concept of reward but to the concept of a 'task' and to the 'solution of a task' within a finite period of time.

But although if one introduces explicitly a behavioral concept of learning one has to clarify what the methodological status of the ‚internal states‘ of a learner can be. As long as one deals only with formal systems or algorithms as such this poses no problem, but at that moment where one is applying these formal concepts to empirical entitities the conditions are changing.

In the realm of empirical systems there are three main possibilities to deal with ‚internal states‘ of a system: (i) from a behavioral (S-R) point of view you are ‚guessing‘ internal states by introducing ‚theoretical terms‘ in your theory; (ii) from y physiological or mechanical (N) point of view you are looking into the system and (iii) from some ‚conscious‘ or ‚inward‘ point of view (P) you have some 'direct' experience of phenomena which you can try to articulate with the means of a language (additionally one has to take into account the implicit restrictions of 'inward views' of neural systems).

We will prefer within LTD case (ii). This implies that one has to correlate the behavioral learning concept explicitly with the mechanical/ physiological concept. In the ‚main-stream‘ reinforcement paradigm there is no discussion about this topic.

External Reward

The assumption that reward is coming from the environment to a learning system seems to us highly implausible as far as one is dealing with biological learning systems. Thus within LTD we will locate the source for reward in the system itself!

Constructive Contextual Reinforcement Learning

The main idea behind CCRL is the assumption that learning presupposes at least the following components:

A learner L endowed with special ‚inner‘ states which represent for the learner a continuum of ‚pre-established meaning‘ related to ‚good/bad‘
A learning task T as part of the environment E of the learner L
Distinguishable stimuli S ‚directed‘ from the environment E to the learner
Distinguishable responses R ‚directed‘ from the learner L to the environment E

These basic assumptions imply some additional assumptions:

the observer of these learning processes is ‚outside‘ of these processes; he/she can ‚observe‘ the behavior of the learner (S-R perspective) or/and the internal states (N perspective (physiological, mechanical...))
The assumption of a ‚direction‘ of stimuli and responses presupposes that there are certain ‚causal‘ mechanisms which are ‚responsible‘ for the occurence of stimuli and responses.

The assumption that reinforcement is not a 'given' part of the environment has some resemblance to socalled constructive epistemology. In constructive epistemology it is assumed that the view of the world is an internal construction of the system which builts this internal view on the states which are internally usable.

this sense one can call this approach not only constructive but also contextual; the learner cannot be defined without an explicit account of certain aspects of the environment (the 'task') which work as a kind of 'index' to classify the learner.

It is an interesting question to which extend reward can be 'connected' to states which are only indirectly' linked to the 'original' reward states.

Some more concrete details to the above assumptions.

The environment E is -from a practical point of view- NOT STATIONARY, NOT FINITE and its states are CONTINOUS.
The learner L is a DISTINGUISHABLE (biological or technical) BODY in the environment. The basic units of the body are finitely many distinguishable states.
The directed stimuli are representing an INCOMPLETE SENSORICAL PERCEPTION as one of the properties of the body which maps a finite portion of the continous environmental states into a FINITE SET OF CONTINOUS VALUES (INPUT VECTOR).
REINFORCEMENT is distinguished as FITNESS in an evolutionary perspective and as LOCAL REWARD connected to some internal states. Fitness is defined as a property bound to the body in relation to an environment and the dynamic of a population. From the perspective of an individual learner only the local reward will be considered. Local reward is assumed to be caused by some DRIVING INTERNAL STATE (REWARD STATES) OF THE LEARNER as property of the body (This induces one main difference to the normal reinforcement paradigm). Local reward can be DELAYED, but is usually connected to some causal mechanisms which generates the REWARDING INTERNAL STATES.
It is assumed that besides INPUT- and OUT-states there is a finite set of COGNITIVE STATES. There is furthermore a COGNITIVE PERCEPTION which maps sensorical perceptions (= INPUT STATES) and REWARD STATES into a finite set of COGNITIVE STATES. Cognitive perception can include modes of ABSTRACTION which can be modified by other sources simultaneously with the sensorical perception (Abstraction includes e.g. the construction of classes, the establishment of similarity relations, subcategorization, etc.).
There is a finite set of COGNITIVE OPERATIONS which can operate on the COGNITIVE STATES and change these states.
The RESPONSES (= ACTIONS) of the learner are represented as a FINITE SET OF CONTINOUS ACTIONS occuring in an environment. These actions depend from another FINITE SET OF INTERNAL CONTINOUS VALUES representing MOTOR-VALUES (= OUTPUT STATES) modulated by the COGNITIVE SYSTEM.

From these asumptions we are getting the following simplyfied structural descriptions (leaving out details of space, time, aggregations etc.):

Behavioral and Cognitive Learning Concepts

As said before does LTD follow a twofold strategy: first, establishing a behavioral learning concept without any relationship to the inner states of the learner, and then establishing a cognitive (mechanistic/ physiological) learning concept which will explicitly be related to the behavioral concept.

Behavioral Learning Concepts

In a first step we will characterize the concepts 'task' and 'solution of a task' within the behavioral paradigm.

The main idea is as follows: a 'task' will be characterized as a realizable sequence of configurations of environmental states with a 'key sequence', an optional 'transfer sequence', and a 'solution sequence'. The 'transitions' between the parts of any of these sequences can either be triggered by ‚actions' of the learner (directly or indirectly by 'instrumental actions') or by general laws/ rules governing the behavior of environmental states.

As long as it is not possible to characterize tasks of this kind independently of any inner states of a learner the task is not in existence from the point of view of a learning theory.

What has to be though of as an 'environmental state' depends from the kind of environment one wants to work with.

A task will be called a serious task if the potential behavior of a learner is sufficiently 'powerful' to produce all those actions which are necessary to transform a key sequence after finitely many timeintervalls into a 'solution sequence'.

Because every serious task can only be a highly idealized description of a small fragment of a ‚real‘ environment it is clear, that a ‚real‘ learner in a ‚real‘ environment can use the learned serious tasks only as a starting point for a more advanced generalization about the presupposed ‚bigger‘ environment.

Different from the description of the learning task as such one can also describe the behavior of a learner as a sequence of interactions with the environment and the task, i.e. we will have a learning sequence for learner L1 consisting of learning units which have the format:

<Learner, timeperiod, environmental states, position of learner, actual task partition, action(s) of learner, new environmental states, new position of learner, new actual task partition>.

Because a learning sequence can contain ‚implicit‘ loops one can the learning space also represent as a directed cyclic graph.

5.2 Learning with Neural Nets based on the INM Neuron

The main idea of a learner endowed with a neural net is shown in the following diagram:

In a simple case there is a mapping between the states of a learner and dedicated neural nets which are interconnected.

For every neuron in a net we assume that the following information is available:

<identifier, value dimension, general position and orientation, dimension, {membran positions}, {synapse positions with orientations}>

The idea behind this is that a neural net as part of a body in a real environment has to cope with space and different types of stimuli. This presupposes that the body of the learner has a defined shape, an actual position and an actual orientation. Relativ to the body’s geometry it is then possible to determine the position of a neuron within a ‚real‘ environment. Besides this allows the information about value dimension us to determine which kind of stimuli of the environment can at some time intervall trigger a certain sensory neuron.

Cognitive Learning Concept

We can distinguish two versions of cognitive learning concepts: ‚model centered‘ and ‚correlational‘.

The correlational version of the cognitive learning concept ‚only‘ correlates task specific activities with those neural activities, which are specific for these activities compared to those neural activities which are happening ‚in any case‘.

The model centered version assumes that it is possible to identify within the activities of a neural net such activities which can be ‚understood‘ as representing a ‚model‘ of the environment. Insofar this is possible one can compare this model with the environment ‚as such‘. The ‚content‘ of learning would then be measured with regard to the ‚richness‘ and ‚adequacy‘ of the model compared to the ‚real‘ environment.

In ‚the long run‘ is LTD-learning theory intended to work with such a cognitive learning concept.