TensorFlow Profiling

This article introduces the TensorFlow profiler for Python user in practice. First, I will present the basic configuration for the profiler, then I use an example to show how to integrate it with existing code. In the end, I conclude with how to access the results from the profiler's output.

TensorFlow profiler (profiler hereafter) is a tool for analyzing deep neural network and performance tuning.

In [1]:
import tensorflow as tf
import tabulate
from tensorflow.profiler import ProfileOptionBuilder as POB

Profile Option Builder

The profiler relies on a dictionary of parameters to control the trackings. The documentation can be found at the following places.

Functions to create default options

TensorFlow provides three methods to create a configuration for basic statistics:

  • tf.profiler.ProfileOptionBuilder.Protrainable_variables_parameter()
  • tf.profiler.ProfileOptionBuilder.float_operation()
  • tf.profiler.ProfileOptionBuilder.time_and_memory()

The above functions returns a dict object contains the default values as below:

In [2]:
opts = [POB.trainable_variables_parameter(), POB.float_operation(), POB.time_and_memory()]
compare_table = [(i, opts[0][i], opts[1][i], opts[2][i]) for i in opts[0].keys()]
print(tabulate.tabulate(compare_table, headers=["Name","trainable_variables_parameter", "float_operation", "time_and_memory"]))
Name                       trainable_variables_parameter    float_operation    time_and_memory
-------------------------  -------------------------------  -----------------  -------------------
max_depth                  10000                            10000              10000
min_bytes                  0                                0                  1
min_micros                 0                                0                  1
min_params                 0                                0                  0
min_float_ops              0                                1                  0
min_occurrence             0                                0                  0
order_by                   name                             float_ops          micros
account_type_regexes       ['_trainable_variables']         ['.*']             ['.*']
start_name_regexes         ['.*']                           ['.*']             ['.*']
trim_name_regexes          []                               []                 []
show_name_regexes          ['.*']                           ['.*']             ['.*']
hide_name_regexes          []                               []                 []
account_displayed_op_only  True                             True               True
select                     ['params']                       ['float_ops']      ['micros', 'bytes']
step                       -1                               -1                 -1
output                     stdout                           stdout             stdout

Customize options

The above detailed configurations can be further modified. There are two approaches:

  1. Not preferred, directly change the value in the dict;
  2. Create a ProfileOptionBuilder object without default, or with defaut in a dictionary. Then update values (see next section), and use build() to create the dict.
In [3]:
config_dict = POB.time_and_memory()
new_config = POB(config_dict)
new_config.with_max_depth(100)
new_config.build()
Out[3]:
{'max_depth': 100,
 'min_bytes': 1,
 'min_peak_bytes': 0,
 'min_residual_bytes': 0,
 'min_output_bytes': 0,
 'min_micros': 1,
 'min_accelerator_micros': 0,
 'min_cpu_micros': 0,
 'min_params': 0,
 'min_float_ops': 0,
 'min_occurrence': 0,
 'order_by': 'micros',
 'account_type_regexes': ['.*'],
 'start_name_regexes': ['.*'],
 'trim_name_regexes': [],
 'show_name_regexes': ['.*'],
 'hide_name_regexes': [],
 'account_displayed_op_only': True,
 'select': ['micros', 'bytes'],
 'step': -1,
 'output': 'stdout'}

Profiler update function and value

Comments are from the detailed documentation in g3doc.

function Key Comment
with_max_depth max_depth Show nodes that are at most this number of hops from starting node in the data structure.
with_min_memory min_bytes
min_peak_bytes
min_residual_bytes
min_output_bytes
Show nodes that request at least this number of bytes.
Show nodes that using at least this number of bytes during peak memory usage.
Show nodes that have at least this number of bytes not being de-allocated after Compute.
Show nodes that have at least this number of bytes output (no necessarily allocated by the nodes).
with_min_execution_time min_micros
min_accelerator_micros
min_cpu_micros
Show nodes that spend at least this number of microseconds to run. It sums accelerator_micros and cpu_micros. Note: cpu and accelerator can run in parallel.
Show nodes that spend at least this number of microseconds to run on accelerator (e.g. GPU).
Show nodes that spend at least this number of microseconds to run on CPU.
with_min_parameters min_parameters Show nodes that contains at least this number of parameters.
with_min_occurrence min_occurrence Show nodes that appear at least this number of times.
with_min_float_operations min_float_ops Show nodes that contain at least this number of float operations. Only available if an node has op.RegisterStatistics() defined and OpLogProto is provided
with_accounted_types account_type_regexes Account and display the nodes whose types match one of the type regexes specified. tfprof allow user to define extra operation types for graph nodes through tensorflow.tfprof.OpLogProto proto. regexes are comma-sperated.
with_node_names start_name_regexes
trim_name_regexes
show_name_regexes
hide_name_regexes
Show node starting from the node that matches the regexes, recursively. regexes are comma-separated.
Hide node starting from the node that matches the regexes, recursively, regexes are comma-seprated.
Show node that match the regexes. regexes are comma-seprated.
Hide node that match the regexes. regexes are comma-seprated.
account_displayed_op_only account_displayed_op_only If True, only account the statistics of ops eventually displayed. If False, account all op statistics matching -account_type_regexes recursively.
with_empty_output output Do not generate side-effect outputs.
with_stdout_output
with_file_output
with_timeline_output with_pprof_output
output Print the result to stdout.
Print the result to a file.
Generate a timeline json file.
Generate a pprof profile gzip file.
order_by order_by Order the results by [name|depth|bytes|peak_bytes|residual_bytes|output_bytes|micros|accelerator_micros|cpu_micros|params|float_ops|occurrence].
select select Comma-separated list of attributes to show. Supported attributes: [bytes|peak_bytes|residual_bytes|output_bytes|micros|accelerator_micros|cpu_micros|params|float_ops|occurrence|tensor_value|device|op_types|input_shapes].
with_step step Show the stats of the this step when multiple steps of RunMetadata were added. By default, show the average of all steps.

A simple demo

In [4]:
# Build a graph
tf.reset_default_graph()
x = [[1]*5 for i in range(2)] # 2x5 matrix
A = [[1]*2 for i in range(2)] # 2x2 matrix

tf_x = tf.constant(x)
tf_A = tf.constant(A)
tf_y = tf.matmul(tf_A, tf_x)
In [5]:
opts = POB.float_operation()
output = tf.profiler.profile(tf.get_default_graph(),
                        cmd='scope',
                        options=opts)
print("The total number of FLOPs is %d"%output.total_float_ops)
The total number of FLOPs is 40
In [6]:
type(output)
Out[6]:
tensorflow.core.profiler.tfprof_output_pb2.GraphNodeProto

How to use the result

The output of profiler is a Protocol Buffer, whose content is defined here

For people not familiar with it (such as me), its values are access by name directly. Or you can convert it to a dict by using this.

The basic keys are:

In [7]:
print("\n".join(output.DESCRIPTOR.fields_by_name.keys()))
name
tensor_value
run_count
exec_micros
accelerator_exec_micros
cpu_exec_micros
requested_bytes
peak_bytes
residual_bytes
output_bytes
parameters
float_ops
devices
total_definition_count
total_run_count
total_exec_micros
total_accelerator_exec_micros
total_cpu_exec_micros
total_requested_bytes
total_peak_bytes
total_residual_bytes
total_output_bytes
total_parameters
total_float_ops
shapes
input_shapes
children

And the values associated with those keys are:

In [8]:
compare_table = [(i, getattr(output,i)) for i in output.DESCRIPTOR.fields_by_name.keys()]
print(tabulate.tabulate(compare_table, headers=["Name","Value"]))
Name                           Value
-----------------------------  -------------------------
name                           _TFProfRoot
tensor_value
run_count                      0
exec_micros                    0
accelerator_exec_micros        0
cpu_exec_micros                0
requested_bytes                0
peak_bytes                     0
residual_bytes                 0
output_bytes                   0
parameters                     0
float_ops                      0
devices                        []
total_definition_count         2
total_run_count                0
total_exec_micros              0
total_accelerator_exec_micros  0
total_cpu_exec_micros          0
total_requested_bytes          0
total_peak_bytes               0
total_residual_bytes           0
total_output_bytes             0
total_parameters               0
total_float_ops                40
shapes                         []
input_shapes                   {}
children                       [name: "MatMul"
                               float_ops: 40
                               total_float_ops: 40
                               input_shapes {
                                 value {
                                   dim {
                                     size: 2
                                   }
                                   dim {
                                     size: 2
                                   }
                                 }
                               }
                               input_shapes {
                                 key: 1
                                 value {
                                   dim {
                                     size: 2
                                   }
                                   dim {
                                     size: 5
                                   }
                                 }
                               }
                               total_definition_count: 1
                               ]

Other notes

If you are careful enough, you will find that the number of float operations seems a bit large. For two matrices in $R^{j\times k}, R^{k\times l}$, it should be $(2k-1)jl$, whereas we have $2jkl$. Because the profiler relies on the RegisterStatistics('flops') defined with the operation, and the matmul is defined here as we found above. So be careful when you are sensitive to the result. Read more here for the disccusion about profiler's caveat in float operations.

Comments