Parsing an execution scenario

In this section, we assume an execution scenario has already been created and executed. For reference on how to do that, check the Execution Scenario section.

optilog.running.parse_scenario(path, parsing_info, trim_path='common', add_logs=False, simplify_index=False, time_scale='s')

Parameters

path (str) – Path to execution scenario
parsing_info (Union[ParsingInfo, Dict[str, ParsingInfo]]) – Parsing Information object to be used to parse the logs. In case of a multi solver scenario a dictionary is accepted with keys as solver names and values as their corresponding ParsingInfo
trim_path – Trim path to the instance. Can be False (don’t trim), “common” (trim the common path of the files), “name” (keep only the name of the files)
add_logs (bool) – Add a logs_path column with the path to the output of the tasks. Can be False (don’t add column), “trimmed” (add the trimmed path to the logs) or “full” (add the full path to the logs)
simplify_index (bool) – Whether or not to simplify the row indexing when possible. If true, when the tasks are only executed with one seed, the seed will be added as a column but not as an index.
time_scale (str) – Scale to be used for the time values. Can be “s” (seconds) or “ms” (milliseconds)

Return type

DataFrame

Returns

DataFrame with the parsed information

In order to parse a scenario we will use the method parse_scenario, which will return a DataFrame with parsed data. parse_scenario parses the logs following the directives of a ParsingInfo object.

class optilog.running.ParsingInfo

Creates a blank Parsing Info with no parsing tags.

static from_template(output_format=None, model_format=None)

Generates a ParsingInfo instance with the most common parsing filters for SAT/MaxSAT experiments.

The parameter output_format parses the Solution Status status line (‘s SATISFIABLE’ etc.) on a filter called sat. Additionally, in maxsat mode we also parse the Solution Cost Line (‘o 13’) on a filter called cost. The parameter model_format parses the Solution Values on a filter called model. The ‘standard’ format expects decimal digits (v 1 -2 3 -4), while the ‘binary’ format expects a binary output (v 1010).

All the filters are added with a timestamp but no history. The cost filter is cast to integer and the model filter is parsed to a list of decimal integers.

Parameters

output_format (Optional[str]) – Accepts {‘sat’, ‘maxsat’, None} as options. Specifies the Solution Status and Solution Cost Line filters.
model_format (Optional[str]) – Accepts {‘standard’, ‘binary’, None} as options. Specifies the format of the model to be parsed. If it is None, models won’t be parsed, which may improve performance.

Return type

ParsingInfo

Returns

A ParsingInfo instance with configured filters

add_filter(name, expression, timestamp=False, save_history=False, cast_to=None)

Adds a filter to parse on the logs. The name of the filter will be added as a column on the pandas DataFrame. The expression must have a matching group (surrounded by parentheses). If the timestamp parameter is set to true, a new column called “time_{name}” will be added to the dataframe with the timestamp in miliseconds. By the default, only the last match of each filter is logged. But if you want to have access to all the matches, enable save_history, which will save the history of all the parsed values as a list. If the cast_to parameter is assigned a numeric type, the matched value of the filter will be casted while parsing the logs.

Parameters

name (str) – Name of the filter tag
expression (str) – Matching regex with at least one group
timestamp (bool) – Whether or not to include the timestamp. The possible values are False for no timestamp, True for all the timestamps, cpu for only the cpu timestamps, and wall for only the wall timestamps.
save_history (bool) – Whether or not to save the history of all the matches.
cast_to (Optional[Type[Union[int, float]]]) – Accepts {None, int, float}. Specifies whether the match should be casted

Here we have a very basic example with a single SAT solver (glucose):

from optilog.running import ParsingInfo, parse_scenario

parsing_info = ParsingInfo.from_template('sat', 'standard')
# (Optional) Add extra parsing regex
parsing_info.add_filter('num_vars', r'Number of variables:\s*(\d+)', cast_to=int)
parsing_info.add_filter('num_clauses', r'Number of clauses:\s*(\d+)', cast_to=int)

df = parse_scenario(
    './example_glucose',
    parsing_info=parsing_info
)

And then analyze the resulting dataframe:

# 1. Print column from the parsing tag:
print(df['glucose41']['num_clauses'])

# 2. Print column from the parsing tag defined in the scenario itself:
print(df['glucose41']['num_vars'])


# 3. Available information:
print(df.columns)


# 4. Manual example:

# 4.0 Select SAT solver:
df = df['glucose41']

# 4.1 Select SAT instances:
print(df[df['sat'] == 'SATISFIABLE'])

# 4.2 Select UNSAT instances solved in less than 10s:
# NOTE: Careful with the parentheses (this is a pandas issue)
print(df[(df['sat'] == 'UNSATISFIABLE') & (df['wall_time_sat'] < 10)])

# 4.3 Select instances with more than 1M clauses and 300k variables

filtered = df[(df['num_vars'] > 250_000) & (df['num_clauses'] > 1_000_000)]
print(filtered)

# 4.4 Check if all the instances of the previous example are unsolved

print('Filtered are unsolved?', filtered['sat'].isnull().all())

ParsingInfo offers a lot of flexibility in terms of parsing. We will now see some examples of what is possible with it:

# Parse the string "Number of clauses: 34" and cast the result to int
parsing_info.add_filter('num_clauses', r'Number of clauses:\s*(\d+)', cast_to=int)

# Parse the string "Suboptimum: 34.72" and cast the result to float.
# Keeps a history of all the parsed values and stores them in a list.
parsing_info.add_filter('subopt', r'Suboptimum:\s*(\d+)', cast_to=float, save_history=True)
# If both timestamp and save_history are True, the history will be a list of tuples (timestamp, value)
# If timestamp is both cpu and wall, the history will be a list of tuples ((cpu, wall), value)

# Parse the string "o Cost: 34" and cast the result to int.
# On parsing, will create two/three columns. A "cost" column with the last match,
# and a "wall_time_cost" (and optionally "cpu_time_cost") column with the timestamp
# in seconds when the string was printed.
parsing_info.add_filter("cost", r"^o\s+(\d+)", timestamp=True, cast_to=int)

# This is the same as the previous example, but the timestamp is in milliseconds.
parsing_info.add_filter("cost", r"^o\s+(\d+)", timestamp=True, cast_to=int, time_scale='ms')

The following example shows how to parse an experiment with the number of conflicts on a number of instances executed with the Glucose solver. Notice that this scenario has been created with a single seed.

from optilog.running import parse_scenario

parsing_info = ParsingInfo()
parsing_info.add_filter('conflicts', 'Num conflicts: (\d+)')

df = parse_scenario(
    './scenario',
    parsing_info=parsing_info
)

We would get an output like this one:

>>> print(df)
                                                    glucose41
                                                    conflicts
instance                                       seed
manthey/traffic/traffic_pcb_unknown.cnf.gz     1       739612
manthey/traffic/traffic_b_unsat.cnf.gz         1       895011
manthey/traffic/traffic_3b_unknown.cnf.gz      1       876072
manthey/traffic/traffic_fb_unknown.cnf.gz      1       937239
manthey/traffic/traffic_kkb_unknown.cnf.gz     1       950170
(...)

Notice that the rows and the columns are a multiindex. If we parsed with simplify_index=True we would get a single index on the rows:

>>> print(df)
                                            glucose41
                                                    seed conflicts
manthey/traffic/traffic_pcb_unknown.cnf.gz             1    739612
manthey/traffic/traffic_b_unsat.cnf.gz                 1    895011
manthey/traffic/traffic_3b_unknown.cnf.gz              1    876072
manthey/traffic/traffic_fb_unknown.cnf.gz              1    937239
manthey/traffic/traffic_kkb_unknown.cnf.gz             1    950170

Now there are two columns, one for the seed and one for the number of conflicts.

Multiindex on the rows is useful when we run the experiment with multiple seeds. Multiindex on the columns is useful when we run the experiment with multiple solvers.

Here is an example of an experiment with multiple seeds and multiple solvers:

>>> print(df)
                                                          glucose   cadical
                                                        conflicts conflicts
instance                                           seed
manthey/traffic/traffic_pcb_unknown.cnf.gz         2         9932     13823
                                                   1         4743      8738
manthey/traffic/traffic_b_unsat.cnf.gz             2        21886     14970
                                                   1        10158     14999
manthey/traffic/traffic_3b_unknown.cnf.gz          1        10778     21547
                                                   2        22622     20365

If we needed to parse the same attributes with different regular expressions (or we needed to parse different attributes), we could supply multiple ParsingInfo objects to the parse_scenario function:

df = parse_scenario(
    './scenario',
    parsing_info={
        'cadical': parsing_info_cadical,
        'glucose': parsing_info_glucose,
    }
)