Organizing parameters – dataclasses and dataconf
The goal
In data science we have a lot of parameters in our simulations. Often the parameters are distributed over the whole source code. The combination of dataclass and dataconf allows it very easily to put the parameters in to a config file. dataconf supports many different config types. Among those are json and yaml.
Questions to David Rotermund
pip install dataconf
Defining the shape of the config
First we define the shape of our configuration data with data classes. We need to have one dataclass Config (or however you want to call it) which is the root of the config.
The root contains variables but also can contain sub-dataclasses which can also contain variables and sub-sub-dataclasses… and so on…
Here an example:
from dataclasses import dataclass
@dataclass
class Date:
day: int
month: int
year: int
hour: int
minute: int
second: int
@dataclass
class MetaData:
channels: int
scale: float
name: str
@dataclass
class Config:
id: int
date: Date
meta_data: MetaData
events: list[int]
If you are not sure that all parameters are included in the config file then you can set the default values or default_factory with the field element in the dataclass. see dataclass for details.
Here is an extra example how this could look like:
@dataclass
class LearningParameters:
"""Parameter required for training"""
learning_active: bool = field(default=True)
loss_mode: int = field(default=0)
loss_coeffs_mse: float = field(default=0.5)
loss_coeffs_kldiv: float = field(default=1.0)
optimizer_name: str = field(default="Adam")
learning_rate_gamma_w: float = field(default=-1.0)
learning_rate_threshold_w: float = field(default=0.00001)
lr_schedule_name: str = field(default="ReduceLROnPlateau")
lr_scheduler_use_performance: bool = field(default=False)
lr_scheduler_factor_w: float = field(default=0.75)
lr_scheduler_patience_w: int = field(default=-1)
lr_scheduler_tau_w: int = field(default=10)
number_of_batches_for_one_update: int = field(default=1)
overload_path: str = field(default="Previous")
weight_noise_range: list[float] = field(default_factory=list)
eps_xy_intitial: float = field(default=1.0)
disable_scale_grade: bool = field(default=False)
kepp_last_grad_scale: bool = field(default=True)
sbs_skip_gradient_calculation: list[bool] = field(default_factory=list)
adapt_learning_rate_after_minibatch: bool = field(default=True)
w_trainable: list[bool] = field(default_factory=list)
The JSON config file
Here an example JSON file tailored for our dataclass Config:
{
"id": 1,
"events": [
1,
2,
3,
4,
5,
6,
7
],
"date": {
"day": 22,
"month": 7,
"year": 1983,
"hour": 1,
"minute": 37,
"second": 12
},
"meta_data": {
"channels": 5555,
"scale": 0.00001,
"name": "Experiment IV"
}
}
Loading in the config file
Now we can load the config file into memory:
import dataconf
cfg = dataconf.file("config.json", Config)
print(cfg)
print("")
print(cfg.id)
print("")
print(cfg.meta_data.name)
Output:
Config(id=1, date=Date(day=22, month=7, year=1983, hour=1, minute=37, second=12), meta_data=MetaData(channels=5555, scale=1e-05, name='Experiment IV'), events=[1, 2, 3, 4, 5, 6, 7])
1
Experiment IV
You can split the config file into several via:
cfg = dataconf.multi.file("network.json").file("dataset.json").file("def.json").on(Config)
I have not tested the other properties of dataconf.
It is told that it can also use config information from different sources:
conf = dataconf.string('{ name: Test }', Config)
conf = dataconf.env('PREFIX_', Config)
conf = dataconf.dict({'name': 'Test'}, Config)
conf = dataconf.url('https://raw.githubusercontent.com/zifeo/dataconf/master/confs/test.hocon', Config)
conf = dataconf.file('confs/test.{hocon,json,yaml,properties}', Config)
and write the config from memory to a file (with dump) or string (with dumps)
dataconf.dump('confs/test.hocon', conf, out='hocon')
dataconf.dump('confs/test.json', conf, out='json')
dataconf.dump('confs/test.yaml', conf, out='yaml')
dataconf.dump('confs/test.properties', conf, out='properties')
Alternative: Hydra
If you need more features then you may want to look into hydra:
- Hierarchical configuration composable from multiple sources
- Configuration can be specified or overridden from the command line
- Dynamic command line tab completion
- Run your application locally or launch it to run remotely
- Run multiple jobs with different arguments with a single command
“Hydra is an open-source Python framework that simplifies the development of research and other complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line. The name Hydra comes from its ability to run multiple similar jobs - much like a Hydra with multiple heads.”
The source code is Open Source and can be found on GitHub.