Standard DataFrames¶
The HistoryMatching library uses several standard Pandas DataFrames for calculation. To use the library the data you pass to it must conform to these dataframes.
Observations¶
HistoryMatching matches a model’s outputs to observations. There are two types of observations: those tied to specific timepoints and those which represent some aggregation of time points. Each has a specific DataFrame associated with it.
ObservationsFrame¶
An ObservationsFrame contains all observations tied to specific time points. Each observation must have a unique observation_id which is associated with the time (e.g. 5 seconds) a particular observation (e.g. mosquito_count) was made. The observation itself must have a value (e.g. 300) and an uncertainty expressed as a standard deviation stdev (e.g. 5). Exact values have an uncertainty of 0 and time must be monotonically increasing.
Each observation must occur only once at each time point. In other words, (time,observation) is a unique key.
In the future, a special time value will denote a summary value; one that is calculated outside of time. An example might be the total number of individuals infected throughout the simulation.
Both your actual observations and your model results must be in ObservationsFrame s.
An example table is shown below.
observation_id observation time value stdev
0 mosquito_count 3 3000 30
1 people_infected 3 10 0
2 mosquito_count 15 1000 10
3 people_infected 15 100 2
The data is stored in the so-called “tidy format”. This data format may, at first glance, seem unwieldy. Another “wide” format may seem preferable, e.g.:
observation_id time mosquito_count mosquito_count_stdev people_count people_count_stdev
0 3 3000 30 10 0
1 15 1000 10 100 2
However, experience has shown that this format can create significant difficulties in data processing. For instance, is mosquito_count_stdev a observation, or the uncertainty in the mosquito_count observation? It is better that you, the user, perform this conversion correctly than hope that we are able to guess your intentions!
Hadley Wickham writes in detail about the benefits of the tidy format here.
ParameterInfoFrame¶
This DataFrame is used to specify the names of parameters used by a WrappedModel as well as the their minimum and maximum values. For a model with parameters beta and gamma, the frame has the following form:
name min max
beta 0.000001 0.01
gamma 0.000001 0.50
ParameterSamplesFrame¶
HistoryMatching explores a parameter space by sampling it. Samples to be explored are stored in a ParameterSamplesFrame. For a model with parameters beta and gamma, the frame would look like follows:
param_id beta gamma
0 0.004407 0.316147
1 0.005409 0.433025
2 0.003196 0.123237
3 0.006439 0.280810
4 0.007980 0.050390
... ... ...
95 0.008666 0.483285
96 0.006346 0.264908
97 0.001813 0.036054
98 0.000379 0.229818
99 0.000878 0.116639
Note that all values in the param_id column are unique.
SimFrame¶
A SimFrame is like an ObservationsFrame except that it includes a replicate and param_id column, like so:
time observation value stdev observation_id replicate param_id
0.000000 susceptible 190.000000 0 0 0 0.0
0.000000 prevalence 5.000000 0 2376 0 0.0
0.000000 infected 10.000000 0 396 0 0.0
0.000000 per_susceptible 95.000000 0 1188 0 0.0
0.000000 per_infected 5.000000 0 1584 0 0.0
... ... ... ... ... ... ...
100.795317 recovered 191.000000 0 1187 0 0.0
100.795317 per_infected 10.747664 0 1979 0 0.0
100.795317 infected 23.000000 0 791 0 0.0
100.795317 per_susceptible 0.000000 0 1583 0 0.0
100.795317 prevalence 10.747664 0 2771 0 0.0
run_replicates() can be used to generate such a series of SimFrame.
MatchedFrame¶
HistoryMatching requires matching simulated observations to actual observations in order for the emulator to learn to approximate the model. This process is fairly standard and so has been encapsulated in the match_sim_outputs_to_observations() function. For time observations, the actual observation is identified as the one occuring closest in time to the modeled observation.
The MatchedFrame consists of a observation_id_a column which is the observation_id of the actual observation, as listed in an ObservationsFrame. The actual observation is associated with a value and its stdev as produced by the simulation. The simulation might have run multiple times, as indicated by the replicate column. The simulation may also have been run for different parameters, as indicated by the param_id column which refers to the eponymous column in a ParameterSamplesFrame.
observation_id_a value stdev replicate param_id
0 2.5 0 0 0.0
1 3.5 0 0 0.0
0 4.5 0 1 0.0
1 8.0 0 1 0.0
0 6.0 0 0 1.0
... ... ... ... ...
1 5.0 0 1 48.0
0 9.5 0 0 49.0
1 1.5 0 0 49.0
0 3.5 0 1 49.0
1 0.0 0 1 49.0