regarding #2 #3

abodh · 2019-04-25T17:26:38Z

  #The gradient for the loss function

 grad_val_ph = tf.placeholder(tf.float32, shape=dis.input_reward.get_shape())

 grad_dis = dis_copy(dis, grad_val_ph)

 #The generator-discriminator for loss function

gen_dis = dis_copy(dis, tf.reduce_max(gen.output, axis=1))

#loss functions

dis_loss = tf.reduce_mean(tf.squeeze(gen_dis.output)) - tf.reduce_mean(tf.squeeze(dis.output))
   + lambda_ * tf.reduce_mean(tf.square(tf.gradients(grad_dis.output, grad_val_ph)[0] - 1))

gen_loss = tf.reduce_mean(-tf.squeeze(gen_dis.output))

#optimization
optim = optimizer(learning_rate=learning_rate)
dis_min_op = optim.minimize(dis_loss, var_list=dis.trainable_variables)
gen_min_op = optim.minimize(gen_loss, var_list=gen.trainable_variables)

Hello Aaron,

Can you show the relation of the above part of the code with the original formula for loss used in the paper?

also, I have another question:

   action_results = sess.run(gen.output, feed_dict={
                                           gen.input_state : np.array([last_obs]),
                                           gen.input_seed : np.array([gen_seed])
                                           })[0]
   optimal_action = np.argmax(action_results)

why are you using action_results[0]? As you said, the generator generates the next state and reward, which one is the state and which one is the reward? since it is a cart-pole problem, shouldn't there be 4 values as the state? I know these questions might seem stupid but I am having a hard time understanding this.

Thanks,
Abodh

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regarding #2 #3

regarding #2 #3

abodh commented Apr 25, 2019

regarding #2 #3

regarding #2 #3

Comments

abodh commented Apr 25, 2019