Cartpole

Description of the task : ``Cart-Pole''

The state space :

The Cartpole is defined by the 4-dimensional state x=(theta, y, theta_dot, y_dot)
where theta is the the angles of the pole, y the position of the cart (see figure) and (theta_dot, y_dot) are their respective velocities.

The control space :

The control u is a 1-dimensional variable that takes a
finite number of possible values between -max_u and +max_u.
It represents the thrust applied to the cart.

The state dynamics :

The state dynamics is defined by the differential equation :
dx
-- = f(x,u)
dt
in the deterministic case, and by the stochastic diff. eq. :
dx = f(x,u) dt + s(x,u) dw
in the stochastic case. Here dw is a Wiener process.
(The formulation of the stochasticity is explained is the
function Cartpole_noise).

The reinforcement functions :

The current and terminal reinforcement functions are described
in the functions :

Cartpole_current_reinf

Cartpole_terminal_reinf

Code in C :

/* Definition of some constants and basic functions */
#define CART_MASS 1.0
#define POLE_MASS 0.1
#define POLE_LENGTH 1.0

/* This is the state dynamics.
Inputs :

tsk is the task (here Cartpole)
state is the current (4d here) state
action is the control (+ or - max_u)

Output :

f is the state dynamics vector (4d here)

*/
void Cartpole_f(task *tsk, double *state, double action, double *f)
{
double x_double_dot, theta_double_dot;
double theta, theta_dot, u;

theta = state[0];
theta_dot = state[2];
u = action;

theta_double_dot = (((CART_MASS + POLE_MASS) * 9.8 * sin(theta) -
           (u + POLE_MASS * POLE_LENGTH * theta_dot * sin(theta))*cos(theta)) /
         ((4.0 / 3.0) * (CART_MASS + POLE_MASS) * POLE_LENGTH -
          (POLE_MASS * POLE_LENGTH * pow(cos(theta), 2))));

x_double_dot = ((pow(theta_dot, 2) * sin(theta) - theta_double_dot *sin(theta))
* POLE_MASS * POLE_LENGTH + u) / (POLE_MASS + CART_MASS);

f[0]=theta_dot;
f[1]=state[3];
f[2]=theta_double_dot;
f[3]=x_double_dot;
}

/* This is the stochastic part.
The stochastic differential equation :
dx = f(x,u) dt + s(x,u) dw
includes 2 part :

the local drift : f(x,u) dt which is the deterministic part
the noise : s(x,u) dw.

Let the matrix a=s.s' (where ' means the transpose)
Then a is a 4x4 symetric positive matrix, thus there exists a set
of 4 orthogonal eigenvectors e1, e2, e3 and e4 with eigenvalues l1 l2, l3 and l4.
This function takes as input :

the (4d) state and action and

and returns the output :

the corresponding eigenvalues and eigenvectors.

For example,
- if there is no noise (deterministic process) just return l1 = l2 = l3 = l4 = 0.
*/
void Cartpole_noise(task *tsk, double *state, double action, double *eig_val, double **eig_vect)
{
/* In this example, the noise is zero (does not depend on
the state or the action) and is as follow :
the eigenvalues and the eigenvectors are :
l1= 0.0 for e1=( 1 0 0 0)
l2= 0.0 for e2=( 0 1 0 0)
l3= 0.0 for e3=( 0 0 1 0 )
l4= 0.0 for e4=( 0 0 0 1 )
*/
eig_val[0]=0.0;
eig_val[1]=0.0;
eig_val[2]=0.0;
eig_val[3]=0.0;

eig_vect[0][0]=1;
eig_vect[1][0]=0;
eig_vect[2][0]=0;
eig_vect[3][0]=0;

eig_vect[0][1]=0;
eig_vect[1][1]=1;
eig_vect[2][1]=0;
eig_vect[3][1]=0;

eig_vect[0][2]=0;
eig_vect[1][2]=0;
eig_vect[2][2]=1;
eig_vect[3][2]=0;

eig_vect[0][3]=0;
eig_vect[1][3]=0;
eig_vect[2][3]=0;
eig_vect[3][3]=1;
}

/* The current reinforcement (here 0 everywhere) */
double Cartpole_current_reinf(task *tsk, double *state, double action, double seconds)
{
return 0;
}

/* The terminal reinforcement : this function is called only when the systems exits from the state space.
If the systems has reached the goal then R=+1, otherwise R=-1.
*/
double Cartpole_terminal_reinf(task *tsk, double *state)
{
if (in_goal(tsk, state) return 1.0;
else return -1.0;
}

Some numerical values :

State space :

theta is bounded by : -1.2 , +1.2
y is bounded by : -10, +10
theta_dot is bounded by : -2, +2
y_dot is bounded by : -4, +4

Action space :

there are 2 actions : -12 and +12 (max_u = 12)

The Goal is defined by :

theta = 0 + or - PI/45
y = 4.3 + or - 0.2
theta_dot = 0 + or - 2.0
y_dot = 0 + or - 4.0

The discount factor : gamma = 0.7

Description of the boundaries :

the boundary related to the first dimension (theta) is a cylindrical boundary : exiting from one side leads to re-enter from the opposite side of the state space.
the boundary related to the second dimension (y) is a normal boundary.
the boundary related to the third dimension (theta_dot) is a bounding boundary (the velocity is bounded to stay in the range [-2, +2]).
the boundary related to the third dimension (y_dot) is a bounding boundary (the velocity is bounded to stay in the range [-4, +4]).

Timestep for the integration (Runge Kutta) of the trajectories : 0.005