Organizing parameters – dataclasses and dataconf

The goal

In data science we have a lot of parameters in our simulations. Often the parameters are distributed over the whole source code. The combination of dataclass and dataconf allows it very easily to put the parameters in to a config file. dataconf supports many different config types. Among those are json and yaml.

Questions to David Rotermund

pip install dataconf

Defining the shape of the config

First we define the shape of our configuration data with data classes. We need to have one dataclass Config (or however you want to call it) which is the root of the config.

The root contains variables but also can contain sub-dataclasses which can also contain variables and sub-sub-dataclasses… and so on…

Here an example:

from dataclasses import dataclass

@dataclass
class Date:
    day: int
    month: int
    year: int

    hour: int
    minute: int
    second: int


@dataclass
class MetaData:
    channels: int
    scale: float
    name: str


@dataclass
class Config:
    id: int
    date: Date
    meta_data: MetaData
    events: list[int]

If you are not sure that all parameters are included in the config file then you can set the default values or default_factory with the field element in the dataclass. see dataclass for details.

Here is an extra example how this could look like:

@dataclass
class LearningParameters:
    """Parameter required for training"""

    learning_active: bool = field(default=True)

    loss_mode: int = field(default=0)
    loss_coeffs_mse: float = field(default=0.5)
    loss_coeffs_kldiv: float = field(default=1.0)

    optimizer_name: str = field(default="Adam")

    learning_rate_gamma_w: float = field(default=-1.0)
    learning_rate_threshold_w: float = field(default=0.00001)

    lr_schedule_name: str = field(default="ReduceLROnPlateau")
    lr_scheduler_use_performance: bool = field(default=False)
    lr_scheduler_factor_w: float = field(default=0.75)
    lr_scheduler_patience_w: int = field(default=-1)
    lr_scheduler_tau_w: int = field(default=10)

    number_of_batches_for_one_update: int = field(default=1)
    overload_path: str = field(default="Previous")

    weight_noise_range: list[float] = field(default_factory=list)
    eps_xy_intitial: float = field(default=1.0)

    disable_scale_grade: bool = field(default=False)
    kepp_last_grad_scale: bool = field(default=True)

    sbs_skip_gradient_calculation: list[bool] = field(default_factory=list)

    adapt_learning_rate_after_minibatch: bool = field(default=True)

    w_trainable: list[bool] = field(default_factory=list)

The JSON config file

Here an example JSON file tailored for our dataclass Config:

{
    "id": 1,
    "events": [
        1,
        2,
        3,
        4,
        5,
        6,
        7
    ],
    "date": {
        "day": 22,
        "month": 7,
        "year": 1983,
        "hour": 1,
        "minute": 37,
        "second": 12
    },
    "meta_data": {
        "channels": 5555,
        "scale": 0.00001,
        "name": "Experiment IV"
    }
}

Loading in the config file

Now we can load the config file into memory:

import dataconf

cfg = dataconf.file("config.json", Config)
print(cfg)
print("")
print(cfg.id)
print("")
print(cfg.meta_data.name)

Output:

Config(id=1, date=Date(day=22, month=7, year=1983, hour=1, minute=37, second=12), meta_data=MetaData(channels=5555, scale=1e-05, name='Experiment IV'), events=[1, 2, 3, 4, 5, 6, 7])

1

Experiment IV

You can split the config file into several via:

cfg = dataconf.multi.file("network.json").file("dataset.json").file("def.json").on(Config)

I have not tested the other properties of dataconf.

It is told that it can also use config information from different sources:

conf = dataconf.string('{ name: Test }', Config)
conf = dataconf.env('PREFIX_', Config)
conf = dataconf.dict({'name': 'Test'}, Config)
conf = dataconf.url('https://raw.githubusercontent.com/zifeo/dataconf/master/confs/test.hocon', Config)
conf = dataconf.file('confs/test.{hocon,json,yaml,properties}', Config)

and write the config from memory to a file (with dump) or string (with dumps)

dataconf.dump('confs/test.hocon', conf, out='hocon')
dataconf.dump('confs/test.json', conf, out='json')
dataconf.dump('confs/test.yaml', conf, out='yaml')
dataconf.dump('confs/test.properties', conf, out='properties')

Alternative: Hydra

If you need more features then you may want to look into hydra:

  • Hierarchical configuration composable from multiple sources
  • Configuration can be specified or overridden from the command line
  • Dynamic command line tab completion
  • Run your application locally or launch it to run remotely
  • Run multiple jobs with different arguments with a single command

“Hydra is an open-source Python framework that simplifies the development of research and other complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line. The name Hydra comes from its ability to run multiple similar jobs - much like a Hydra with multiple heads.”

The source code is Open Source and can be found on GitHub.