Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure log entries use consistent ordering and types for columns #1404

Merged
merged 36 commits into from
Jul 24, 2024

Conversation

matt-graham
Copy link
Collaborator

As mentioned in #1227 (comment) currently there are instances of structured log entries where the columns value computed for the header entry in the log from the first log instance for that key, is not consistent with the value for columns that would be computed from subsequent log instances for the key, due to for example the columns dicts have keys (column names) in different orders, different values (types for a given key) or completely non-overlapping key-value pairs.

This PR makes a series of related changes

  • The columns dict values in the header log entry are now stored in the logger object and compared to corresponding dict computed for subsequent log entries with same key and if these do not match a new exception type, InconsistentLoggedColumnsError is raised, with details of the differences.
  • When converting logging data to a dictionary, if the data is a dictionary or dataframe, which is converted to a dictionary, the resulting dictionary is sorted by its keys (allowing for them to be a mix of string and numeric values), to ensure consistent ordering of the columns even if the dictionaries / dataframes themselves do not necessarily order the keys / column consistently. Similarly set data is sorted before converting to a dictionary.
  • A helper function is added to tlo.logging.helpers for logging the properties of an individual in the population dataframe as a dictionary in a way that ensures stability of the types of the property values. In particular this is achieved by returning the NumPy / pandas extension scalar types associated with the datatype of the array underlying a particular column / property, except for the case of nullable booleans and categoricals for which there is no corresponding scalar type, in which case a length-1 array is instead used. The JSON encoding rules for these types is also updated accordingly.
  • For other cases where log entries were found to have inconsistent columns not covered by fixes above, manual fixes to ensure type stability were added (typically wrapping values in a call to a built-in Python type such as float or str) and in some cases to ensure keys were aligned across different logger calls.

Hopefully with fixes here we should be able to achieve consistent logging when suspending and resuming simulation using changes in #1227 compared to running contiguously.

@matt-graham matt-graham marked this pull request as draft June 21, 2024 17:35
@matt-graham
Copy link
Collaborator Author

In addition to the changes described above, this PR now also includes some more general changes / clean-up to the logging modules

  • The functionality for old style logging messages has been removed from the Logger class as this no longer seems to be used anywhere.
  • The functionality for allowing logging of multirow dataframes has also been removed as this was only used in tests and never publicly exposed.
  • The repetition across the various level specific logging methods has been removed by using partialmethod
  • The interface to the Logger class has been cleaned up, with all attributes that do not appear to be used outside of the class hinted as private by underscore prefixing, attributes which are used externally generally exposed as read only properties and some methods which did not require access to the class instance factored out to to functions.
  • The direct coupling between the Logger and Simulation classes has been removed by instead having a global 'private' function _get_simulation_date which can be set when initialising logging to a function which retrieves current date from simulation object. This simplifies unit testing the logging system and keeps a cleaner separation between the different framework components. A new reset function is also added to allow restoring the global state in the logging module (simulation date getter function and dictionary of initialised loggers) to original values.
  • Some functions which were previously defined in the logging helpers module have been moved to the core module to avoid relying on importing internal (underscore prefixed) names.
  • Type hints have been added to all functions and methods in src/tlo/logging/core.py and src/tlo/logging/helpers.py.
  • The unit tests in tests/test_logging.py have been rewritten from scratch to give better coverage and also test the various factored out utility functions.

@matt-graham
Copy link
Collaborator Author

A remaining question is whether we want the current InconsistentLoggedColumnsError exception to instead be a warning. At the moment a simulation will fail if an attempt is made to create a log record for a key with a different structure than when the first record was used to create the header. As these failures can occur quite far in to the simulation for inrequently logged keys / log records triggered by rare events, this can lead to annoying late failures, so a warning may be a better option as an inconsitent log entry structure is not a 'fatal' error.

@matt-graham matt-graham marked this pull request as ready for review July 2, 2024 14:28
@matt-graham matt-graham requested a review from tamuri July 2, 2024 14:28
@tamuri
Copy link
Collaborator

tamuri commented Jul 8, 2024

That sounds like a good idea, at least to start. We can monitor the warnings on the nightly scale run.

@matt-graham
Copy link
Collaborator Author

That sounds like a good idea, at least to start. We can monitor the warnings on the nightly scale run.

Now changed the exception to a warning.

Copy link
Collaborator

@tamuri tamuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good - one minor suggestion (not a big deal).

src/tlo/logging/core.py Outdated Show resolved Hide resolved
@matt-graham matt-graham merged commit 625b4d9 into master Jul 24, 2024
59 checks passed
@matt-graham matt-graham deleted the mmg/inconsistent-log-columns-error-and-fixes branch July 24, 2024 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants