The Acrobot

Description of the task : ``Acrobot''

The state space :

The Acrobot is defined by the 4-dimensional state x=(theta1, theta2, theta1_dot, theta2_dot)
where (theta1, theta2) are the angles (see figure) and (theta1_dot, theta2_dot) are their velocities.

The control space :

The control u is a 1-dimensional variable that takes a
finite number of possible values between -max_torque and +max_torque.
It represents the torque applied to the elbow.

The state dynamics :

The state dynamics is defined by the differential equation :
dx
-- = f(x,u)
dt
in the deterministic case, and by the stochastic diff. eq. :
dx = f(x,u) dt + s(x,u) dw
in the stochastic case. Here dw is a Wiener process.
(The formulation of the stochasticity is explained is the
function acrobot_noise).

The reinforcement functions :

The current and terminal reinforcement functions are described
in the functions :

acrobot_current_reinf

acrobot_terminal_reinf

Code in C :

/* Definition of some constants and basic functions */
#define max_torque 2.0
#define friction 0.0
#define m1 1.0 /* Mass of shoulder-to-elbow joint */
#define m2 1.0 /* Mass of elbow-to-hand joint */
#define l 1.0 /* Distance from shoulder to elbow AND Distance from elbow to hand */
#define GRAVITY 9.81

/* This is the state dynamics.
Inputs :

tsk is the task (here Acrobot)
state is the current (4d here) state
action is the control (+ or - max_torque)

Output :

f is the state dynamics vector (4d here)

*/
void acrobot_f(task *tsk, double *state, double action, double *f)
{
double p1, v1, p2, v2, t1, t2, a1, a2;

p1 = state[0];
p2 = state[1];
v1 = state[2];
v2 = state[3];
t1 = action;
t2 = 0;

a1 = 0;
a2 = 0;

joint_accns(p1, v1, p2, v2, t1, t1, &a1, &a2);

f[0]=v1;
f[1]=v2;
f[2]=a1;
f[3]=a2;
}

/* This is the stochastic part.
The stochastic differential equation :
dx = f(x,u) dt + s(x,u) dw
includes 2 part :

the local drift : f(x,u) dt which is the deterministic part
the noise : s(x,u) dw.

Let the matrix a=s.s' (where ' means the transpose)
Then a is a 4x4 symetric positive matrix, thus there exists a set
of 4 orthogonal eigenvectors e1, e2, e3 and e4 with eigenvalues l1 l2, l3 and l4.
This function takes as input :

the (4d) state and action and

and returns the output :

the corresponding eigenvalues and eigenvectors.

For example,
- if there is no noise (deterministic process) just return l1 = l2 = l3 = l4 = 0.
*/
void acrobot_noise(task *tsk, double *state, double action, double *eig_val, double **eig_vect)
{
/* In this example, the noise is constant (does not depend on
the state or the action) and is as follow :
the eigenvalues and the eigenvectors are :
l1= 0.02 for e1=( 1 0 0 0)
l2= 0.02 for e2=( 0 1 0 0)
l3= 0.004 for e3=( 0 0 1 0 )
l4= 0.004 for e4=( 0 0 0 1 )
*/
eig_val[0]=0.02;
eig_val[1]=0.02;
eig_val[2]=0.004;
eig_val[3]=0.004;

eig_vect[0][0]=1;
eig_vect[1][0]=0;
eig_vect[2][0]=0;
eig_vect[3][0]=0;

eig_vect[0][1]=0;
eig_vect[1][1]=1;
eig_vect[2][1]=0;
eig_vect[3][1]=0;

eig_vect[0][2]=0;
eig_vect[1][2]=0;
eig_vect[2][2]=1;
eig_vect[3][2]=0;

eig_vect[0][3]=0;
eig_vect[1][3]=0;
eig_vect[2][3]=0;
eig_vect[3][3]=1;
}

/* The current reinforcement (here 0 everywhere) */
double acrobot_current_reinf(task *tsk, double *state, double action, double seconds)
{
return 0;
}

/* The terminal reinforcement : this function is called only when the systems reached the goal
*/
double acrobot_terminal_reinf(task *tsk, double *state)
{
return 1.0;
}

/* Now here are some internal functions */
/*
pi = angle of ith joint
vi = angular velocity of ith joint
ti = torque applied to ith joint
ai = angular acceleration of ith joint

Fu et al's robot book gives that

t = D a + h + k

where t , a are vectors, of torques and accns respectively.
D IS A SYMMETRIC 2X2 MATRIX
h and k are vectors
*/
void joint_accns (double p1, double v1, double p2, double v2, double t1, double t2, double *res_a1, double *res_a2)
{
double c1 = cos(p1);
double c2 = cos(p2);
double c12 = cos(p1 + p2);
double s2 = sin(p2);
double lsqd = l * l;
double m2_lsqd = m2 * lsqd;
double m2gl = m2 * GRAVITY * l;
double d11 = 1.0/3 * m1 * lsqd + (4.0/3 + c2) * m2_lsqd;
double d12 = (1.0/3 + 1.0/2 * c2) * m2_lsqd;
double d22 = 1.0/3 * m2_lsqd;
double h1 = -m2_lsqd * s2 * v2 * ( 1.0/2 * v2 + v1);
double h2 = 1.0/2 * s2 * v1 * v1 * m2_lsqd;
double k2 = 1.0/2 * m2gl * c12;
double k1 = 1.0/2 * m1 * GRAVITY * l * c1 + k2 + m2gl * c1;

double y1 = t1 - h1 - k1;
double y2 = t2 - h2 - k2;

double b = d12 * d12 - d11 * d22;

if ( fabs(b) < 1e-20 ) my_error("tarm.c singularity");

*res_a2 = (d12 * y1 - d11 * y2) / b;
*res_a1 = (d12 * y2 - d22 * y1) / b;
}

/* Gives the x and y coordinates of the elbow in (*j1_x,*j1_y)
and the hand in (*j2_x,*j2_y). So a simple stick figure drawing of
the arm is just two lines, the first from (0,0) to (j1x,j1y). The
second from (j1x,j1y) to (j2x,j2y) */
void arm_kinematics (double p1, double p2, double *j1_x, double *j1_y, double *j2_x, double *j2_y)
{
double c1 = cos(p1);
double c12 = cos(p1 + p2);
double s1 = sin(p1);
double s12 = sin(p1 + p2);

*j1_x = l * c1;
*j1_y = l * s1;
*j2_x = *j1_x + l * c12;
*j2_y = *j1_y + l * s12;
}

Some numerical values :

State space :

theta1 is bounded by : -3.PI/2, +PI/2
theta2 is bounded by : -PI, +PI
theta1_dot is bounded by : -10, +10
theta2_dot is bounded by : -10, +10

Action space :

there are 2 actions : -2 and +2 (max_torque = 2)

The Goal (red area in the figure) is defined by :

theta1 = PI/2 + or - PI/16
theta2 = 0 + or - PI/16
theta1_dot = 0 + or - 8.0
theta2_dot = 0 + or - 8.0

The discount factor : gamma = 0.93

Description of the boundaries :

the boundary related to the first dimension (theta1) is a cylindrical boundary : exiting from one side leads to re-enter from the opposite side of the state space.
the boundary related to the second dimension (theta2) is a cylindrical boundary.
the boundary related to the third dimension (theta1_dot) is a bounding boundary (the velocity is bounded to stay in the range [-10, +10]).
the boundary related to the third dimension (theta1_dot) is a bounding boundary (the velocity is bounded to stay in the range [-10, +10]).

Timestep for the integration (Runge Kutta) of the trajectories : 0.01

Timeout : the system has to reach the goal within 40 seconds, otherwise it is automatically stopped and the global reinforcement is 0.