A Minimal DBC Implementation

In this tutorial, we’ll explore how to implement a basic Diffusion Behavior Clone (DBC) using CleanDiffuser. DBC is an imitation learning algorithm that aims to replicate behaviors from an offline dataset. It leverages a diffusion model to generate samples from the policy distribution \pi_\theta(a|s).

1 Setting Up the Environment

Since the imitation learning algorithms simply imitate the behaviors in the offline dataset, we cannot expect DBC to perform well on low-quality datasets. In this tutorial, we use kitchen-complete-v0 from D4RL as our test environment. In this environment, our agent needs to complete various kitchen tasks within a limited time, such as moving the kettle, opening the microwave, and so on. The complete dataset provides many demonstrations that can fully complete the tasks.

import d4rl
import gym

# --------------- Setting Up the Environment ---------------
env = gym.make("kitchen-complete-v0")
dataset = d4rl.qlearning_dataset(env)
obs_dim, act_dim = dataset['observations'].shape[-1], dataset['actions'].shape[-1]
size = len(dataset['observations'])

2 Choosing the Network Architecture

If we consider diffusion models as the “soul”, then neural networks serve as its “shell”. The probability distribution we aim to fit is an action distribution conditioned on state. To begin, we require a nn_condition for preprocessing the condition, i.e., states. We utilize the PearceObsCondition implemented in CleanDiffuser to encode the input observation sequence (a single observation if historical observations are not used as context) into features of shape (batch_size, To*emb_dim), which is then fed into nn_diffusion. The nn_diffusion is employed to estimate the unknown terms in the diffusion reverse process. In this tutorial, we utilize PearceMlp implemented in CleanDiffuser to predict the scaled score function. In the implementation, we only need to import these two classes and initialize them, both of which inherit from PyTorch’s nn.Module.

from cleandiffuser.nn_condition import PearceObsCondition
from cleandiffuser.nn_diffusion import PearceMlp

# --------------- Network Architecture -----------------
nn_diffusion = PearceMlp(act_dim, To=1, emb_dim=64, hidden_dim=256, timestep_emb_type="positional")
# (bs, act_dim), (bs, ), (bs, To * emb_dim) -> (bs, act_dim)
nn_condition = PearceObsCondition(obs_dim, emb_dim=64, flatten=True, dropout=0.0)
# (bs, To, obs_dim) -> (bs, To * emb_dim)

3 Crafting the Diffusion Actor

We choose DiscreteDiffusionSDE, which optimizes a score-matching loss to learn the score function in VPSDE and discretizes the time interval of the diffusion process into a finite number of timesteps. During sampling, we can choose any number of sampling steps greater than 1 and not less than diffusion_steps, and we can select a range of available solvers. When instantiating this class, we also define several other parameters. Setting predict_noise=False will instruct the NN to directly predict denoised actions rather than noise. optim_params will override the default creation parameters of the optimizer. x_max and x_min will clip the range of generated data during the sampling process to reduce out-of-distribution sampling.

from cleandiffuser.diffusion.diffusionsde import DiscreteDiffusionSDE

# --------------- Diffusion Model Actor --------------------
actor = DiscreteDiffusionSDE(
    nn_diffusion, nn_condition, predict_noise=False, optim_params={"lr": 3e-4},
    x_max=+1. * torch.ones((1, act_dim)),
    x_min=-1. * torch.ones((1, act_dim)),
    diffusion_steps=5, ema_rate=0.9999, device=device)

4 Training and Evaluation

Generally, training the diffusion model only requires using the update method, providing the raw data and the corresponding generation conditions for a single gradient update (the condition parameter is not required when not needed). During sampling, we simply need to call the sample method to obtain the generated data from the diffusion model and some accompanying logs produced during this process.

# --------------- Training -------------------
actor.train()

avg_loss = 0.

# train for 100,000 gradient steps (Actually it is not enough for a well-trained model.)
for t in range(100000):

    # sample a batch
    idx = np.random.randint(0, size, (256,))
    obs = torch.tensor(dataset['observations'][idx], device=device).float()
    act = torch.tensor(dataset['actions'][idx], device=device).float()

    # one-step update
    avg_loss += actor.update(act, obs)["loss"]

    # logging
    if (t + 1) % 1000 == 0:
        print(f'[t={t + 1}] {avg_loss / 1000}')
        avg_loss = 0.

# model saving
savepath = "tutorials/results/1_Minimal_DBC/"
if not os.path.exists(savepath):
    os.makedirs(savepath)
actor.save(savepath + "diffusion.pt")
            
# -------------- Inference -----------------
savepath = "tutorials/results/1_Minimal_DBC/"
actor.load(savepath + "diffusion.pt")
actor.eval()

# concurrently evaluate 50 environments
env_eval = gym.vector.make("kitchen-complete-v0", num_envs=50)

obs, cum_done, cum_rew = env_eval.reset(), 0., 0.

"""
`prior` is used to provide prior information, such as known parts of the generated data. 
In this context, it is evident that we do not have any prior information to provide.
"""
prior = torch.zeros((50, act_dim), device=device)
for t in range(280):

    # sample with DDPM and a sampling step of 5
    act, log = actor.sample(
        prior, solver="ddpm", n_samples=50, sample_steps=5,
        temperature=0.5, w_cfg=1.0,
        condition_cfg=torch.tensor(obs, device=device, dtype=torch.float32))
    act = act.cpu().numpy()

    obs, rew, done, info = env_eval.step(act)
    cum_done = np.logical_or(cum_done, done)
    cum_rew += rew

    print(f'[t={t}] cum_rew: {cum_rew}')

    if cum_done.all():
        break

print(f'Mean score: {np.clip(cum_rew, 0., 4.).mean() * 25.}')
env_eval.close()