find the previous version of this document at crankshaft/performance-profiling.md
- General Strategies to track and improve Performance
- Node.js Perf And Tooling
- Inspecting V8
- Considerations when Improving Performance
watch | slide | watch profiling workflow
Analyse performance only once you have a problem in a top down manner like so:
- ensure it's JavaScript and not the DOM
- reduce testcase to pure JavaScript and run in
d8
shell - collect metrics and locate bottlenecks
- sample profiling to narrow down the general problem area
- at this point think about the algorithm, data structures, techniques, etc. used in this area and evaluate if improvements in this area are possible since that will most likely yield greater impact than any of the more fine grained improvments
- structural profiling to isolate the exact area i.e. function in which most time is spent
- evaluate what can be improved here again thinking about algorithm first
- only once algorithm and data structures seem optimal evaluate how the code structure affects assembly code generated by V8 and
possible optimizations (small functions,
try/catch
, closures, loops vs.forEach
, etc.)
- optimize slowest section of code and repeat structural profiling
- at fixed frequency program is instantaneously paused by setting stacksize to 0 and the call stack sampled
- assumes that the sample is representative of workload
- gives no sense fo flow to due gaps between samples
- functions that were inlined by compiler aren't shown
- collect data for longer period of time, sampling every 1ms
- ensure code is exercising the right code paths
- functions are instrumented to record entry and exit times
- three data points per function
- Inclusive Time: time spent in function including its children
- Exclusive Time: time spent in function excluding its children
- Call Count: number of times the function was called
- data points are taken at much higher frequency than sampling
- higher cost than sampling dut to instrumentation
- goal of optimization is to minimize inclusive time
- inlined functions retain markers
- think about data being processed
- is one piece of data slower?
- name time ranges based on data
- use variables/properties to dynamically name ranges
+--------------------------------------------------------------------------------------------+
| | Sampling | Structural / Instrumenting |
|-----------------------------------+------------------------+-------------------------------|
| Time | Approximate | Exact |
| Invocation count | Approximate | Exact |
| Overhead | Small | High(er) |
| Accuracy | Good - Poor | Good - Poor |
| Extra code / instrumentation | No | Yes |
+--------------------------------------------------------------------------------------------+
- need both
- manual instrumentation can reduce overhead
- instrumentation affects performance and may affect behavior
- samples are very accurate, but inaccurate for extracting time
- sampling requires no program modification
- each module of app sould have time budget
- sum of modules should be
< 16ms
for smooth client side apps - track performance daily or per commit in order to catch budget busters right away
- queue up key handlers and execute inside Animation Frame
- optimize for lowest common denominator that your app will run on
- for mobile stay below
8-10ms
since remaining time is needed for chrome to do its work, i.e. render
- Node.js is inspectable via the
--inspect
and similar flags - multiple tools, like DevTools (chrome://inspect) and VS Code integrate with it to allow debugging and profiling Node.js applications
- DevTools includes dedicated Node.js window that auto connects to any Node.js process that is launched with the debugger enabled
- in line breakpoints allow breaking on specific statement on a line with multiple statements
- async code flow debugging is supported (async stack traces)
- launch your application with
node --inspect app.js
- open dedicated Node.js DevTools via chrome://inspect or by clicking the Node.js icon in the upper left of any DevTools window
- select the profiler tab and click Start while loading your application and Stop when done
- inspect the profile via the Chart view to see the timeline of function execution
- select the Heavy (Bottom Up) view to see functions ordered by aggregated time spent executing them
- in both views
Cmd|Ctrl - F
allows searching for specific functions, i.e. searching forgarbage
will highlight all instances in which garbage was collected - the time spent in
program
frames is either spent in native code or idling and if a lot of those frames it is advised to profile either vianode --prof
or kernel tracing tools like perf or dtrace in conjunction with flamegraph visualizers
- takes perf or dtrace results as input and transforms them into flamegraphs in three steps using the included Perl scripts
- when using perf and the Node.js process was run with
--perf-basic-prof
it is able to resolve JavaScript symbols - thlorenz/flamegraph is a port of those scripts to a JavaScript module and application to make it easier to use with the added benefit of resolving JavaScript symbols when using dtrace as well
- 0x took ideas from both of these tools to provide a very easy to use solution to profile Node.js applications
- 0x combines all steps to generate flamegraphs into an easy to use package and is the recommended solution to produce flamegraphs for Node.js applications running locally
- by default it adds the
--prof
flag when running your Node.js application and includes JavaScript as well as native symbols for V8 and Node.js, selectable in the toolbar of the visualization (V8 and cpp) - for cases in which lower level inspection is required, the command line
api includes a
--kernel-tracing
flag which will use perf or dtrace (depending on OS) to profile your application and produce a flamegraph from its output - an interactive demo can be found here
read | read very comprehensive resource on perf with tons of examples
Only available on Linux.
perf record -g node --perf-basic-prof app.js
perf report ## add --no-children to avoid expanding nodes
- visualize
perf
output via Brendan Gregg's Flamegraph Tool - 0x uses it under the hood to produce input when
--kernel-tracing
is enabled when profiling Node.js applications on Linux
Only available on BSD Unix, like Mac OSX or Solaris.
- since dtrace accepts a full script, there are infinite ways of profiling applications with it
- here is an example script that produces data that can become input to tools to produce flamegraphs
- 0x uses it under the hood to produce input when
--kernel-tracing
is enabled when profiling Node.js applications on BSD Unixes - here is an example to do this manually for Node.js applications
Multiple flags and so called run time functions are available to anyone who likes to peek into the inner workings of V8.
--print-ast
prints the AST generated by V8 to the console
--print-bytecode
prints bytecode generated by ignition interpreter to the console- provides more info when run with debug build d8, i.e. information about maps created
--trace-ic
dumps IC traces to the console- pipe that output into
./v8/tools/ic-processor
to visualize it
--print-opt-code
prints the actual optimized code that is generated by TurboFan--code-comments
adds comments to printed optimized code
--trace-opt
traces lazy optimization- generic ICs are bad as if lots of them are present, code will not be optimized
- ICs with typeinfo are good
--trace-maps
in combination with--trace-maps-details
trace map generation into v8.log--expose-gc
allows forcing GC viagc()
in your code to see which maps are short lived- the output can be parsed to see what maps hidden classes and transitions into other maps V8 creates to represent your objects
- a graphical presentation is also available by loading the resulting
v8.log
into/v8/tools/map-processor.htm
(requires V8 checkout)
--runtime-call-stats
dumps statistics about the V8 runtime to the console- these stats give detailed info where V8 time is spent
Sample Output (abbreviated)
Runtime Function/C++ Builtin Time Count
========================================================================================
JS_Execution 8.82ms 47.24% 1 0.11%
RecompileSynchronous 3.89ms 20.83% 7 0.75%
API_Context_New 2.20ms 11.78% 1 0.11%
GC_SCAVENGER_SCAVENGE 0.88ms 4.73% 15 1.60%
AllocateInNewSpace 0.51ms 2.71% 71 7.59%
GC_SCAVENGER_SCAVENGE_ROOTS 0.38ms 2.04% 15 1.60%
GC_SCAVENGER_SCAVENGE_PARALLEL 0.24ms 1.29% 15 1.60%
GC_SCAVENGER_BACKGROUND_SCAVENGE_PARALLEL 0.18ms 0.94% 15 1.60%
GC_Custom_SlowAllocateRaw 0.15ms 0.79% 6 0.64%
CompileForOnStackReplacement 0.13ms 0.70% 2 0.21%
ParseProgram 0.13ms 0.70% 1 0.11%
CompileIgnition 0.13ms 0.69% 4 0.43%
PreParseNoVariableResolution 0.11ms 0.59% 3 0.32%
OptimizeCode 0.09ms 0.46% 5 0.53%
CompileScript 0.08ms 0.41% 1 0.11%
Map_TransitionToDataProperty 0.07ms 0.40% 92 9.84%
InterpreterDeserializeLazy 0.07ms 0.36% 28 2.99%
Map_SetPrototype 0.06ms 0.29% 249 26.63%
FunctionCallback 0.05ms 0.27% 3 0.32%
ParseFunctionLiteral 0.04ms 0.22% 3 0.32%
GC_HEAP_EPILOGUE 0.04ms 0.19% 15 1.60%
[ ... ... ... ... ... ]
----------------------------------------------------------------------------------------
Total 18.67ms 100.00% 935 100.00%
- V8 heap statistics feature provides insight into both the V8 managed heap and the C++ heap
--trace-gc-object-stats
dumps memory-related statistics to the console- this data can be visualized via the V8 heap visualizer
- make sure to not log to stdout when generating the
v8.gc_stats
file - NOTE: when I tried this tool by loading a
v8.gc_stats
generated vianode --trace-gc-object-stats script.js > v8.gc_stats
it errored - serving
v8 ./tools/heap-stats
locally had the same result
- make sure to not log to stdout when generating the
- enable native functions via
--allow-natives-syntax
- then use
%DebugPrint(array)
to dump information about this array to the console - the
elements
field will hold information about the elements kinds of the array
Sample Output (abbreviated)
DebugPrint: 0x1fbbad30fd71: [JSArray]
- map = 0x10a6f8a038b1 [FastProperties]
- prototype = 0x1212bb687ec1
- elements = 0x1fbbad30fd19 <FixedArray[3]> [PACKED_SMI_ELEMENTS (COW)]
- length = 3
- properties = 0x219eb0702241 <FixedArray[0]> {
#length: 0x219eb0764ac9 <AccessorInfo> (const accessor descriptor)
}
- elements= 0x1fbbad30fd19 <FixedArray[3]> {
0: 1
1: 2
2: 3
}
[…]
--trace-elements-transitions
dumps elements transitions taking place to the console
Sample Output
elements transition [PACKED_SMI_ELEMENTS -> PACKED_DOUBLE_ELEMENTS]
in ~+34 at x.js:2 for 0x1df87228c911 <JSArray[3]>
from 0x1df87228c889 <FixedArray[3]> to 0x1df87228c941 <FixedDoubleArray[22]>
- when Node.js or d8 is run with the
--trace-turbo
it outputs oneturbo-*.json
file per function - each JSON file includes information about optimized code along the various phases of Turbofan's optimization pipeline
- turbolizer is a tool derived from the turbolizer application included with V8
- it visualizes the TurboFan optimization pipeline information and provides easy navigation between source code, Turbofan IR graphs, scheduled IR nodes and generated assembly code
Three groups of optimizations are algorithmic improvements, workarounds JavaScript limitations and workarounds for V8 related issues.
As we have shown V8 related issues have decreased immensely and should be reported to the V8 team if found, however in some cases workarounds are needed.
However before applying any optimizations first profile your app and understand the underlying problem, then apply changes and prove by measuring that they change things for the better.
- different performance problems call for different approaches to profile and visualize the cause
- learn to use different profilers including low level profilers like
perf
- before applying micro optimizations to your code reason about its abstract complexity
- evaluate how your code would be used on average and in the worst case and make sure your algorithm handles both cases in a performant manner
- prefer monomorphism in very hot code paths if possible, as polymorphic functions cannot be optimized to the extent that monomorphic ones can
- measure that strategies like caching and memoization actually result in performance improvements before applying them as in some cases the cache lookup maybe more expensive than performing the computation
- understand limitations and costs of V8, a garbage collected system, in order to choose
appropriate data types to improve performance, i.e. prefer a
Uint8Array
over aString
when it makes sense