Skip to content

Latest commit

 

History

History
438 lines (333 loc) · 20.6 KB

inspection.md

File metadata and controls

438 lines (333 loc) · 20.6 KB

Inspection and Performance Profiling

find the previous version of this document at crankshaft/performance-profiling.md

General Strategies to track and improve Performance

Identify and Understand Performance Problem

watch | slide | watch profiling workflow

Analyse performance only once you have a problem in a top down manner like so:

  • ensure it's JavaScript and not the DOM
  • reduce testcase to pure JavaScript and run in d8 shell
  • collect metrics and locate bottlenecks
  • sample profiling to narrow down the general problem area
    • at this point think about the algorithm, data structures, techniques, etc. used in this area and evaluate if improvements in this area are possible since that will most likely yield greater impact than any of the more fine grained improvments
  • structural profiling to isolate the exact area i.e. function in which most time is spent
    • evaluate what can be improved here again thinking about algorithm first
    • only once algorithm and data structures seem optimal evaluate how the code structure affects assembly code generated by V8 and possible optimizations (small functions, try/catch, closures, loops vs. forEach, etc.)
  • optimize slowest section of code and repeat structural profiling

Sampling CPU Profilers

watch | watch walkthrough

  • at fixed frequency program is instantaneously paused by setting stacksize to 0 and the call stack sampled
  • assumes that the sample is representative of workload
  • gives no sense fo flow to due gaps between samples
  • functions that were inlined by compiler aren't shown
  • collect data for longer period of time, sampling every 1ms
  • ensure code is exercising the right code paths

Structural CPU Profilers

watch | watch walkthrough

  • functions are instrumented to record entry and exit times
  • three data points per function
    • Inclusive Time: time spent in function including its children
    • Exclusive Time: time spent in function excluding its children
    • Call Count: number of times the function was called
  • data points are taken at much higher frequency than sampling
  • higher cost than sampling dut to instrumentation
  • goal of optimization is to minimize inclusive time
  • inlined functions retain markers

Instrumentation Techniques

watch

  • think about data being processed
    • is one piece of data slower?
  • name time ranges based on data
    • use variables/properties to dynamically name ranges

Instrumenting vs. Sampling

watch

+--------------------------------------------------------------------------------------------+
|                                   |      Sampling          |    Structural / Instrumenting |
|-----------------------------------+------------------------+-------------------------------|
| Time                              |       Approximate      |            Exact              |
| Invocation count                  |       Approximate      |            Exact              |
| Overhead                          |       Small            |            High(er)           |
| Accuracy                          |       Good - Poor      |            Good - Poor        |
| Extra code / instrumentation      |       No               |            Yes                |
+--------------------------------------------------------------------------------------------+
  • need both
  • manual instrumentation can reduce overhead
  • instrumentation affects performance and may affect behavior
  • samples are very accurate, but inaccurate for extracting time
  • sampling requires no program modification

Plan for Performance

watch

  • each module of app sould have time budget
  • sum of modules should be < 16ms for smooth client side apps
  • track performance daily or per commit in order to catch budget busters right away

Animation Frame

watch | watch walkthrough

  • queue up key handlers and execute inside Animation Frame
  • optimize for lowest common denominator that your app will run on
  • for mobile stay below 8-10ms since remaining time is needed for chrome to do its work, i.e. render

Node.js Perf And Tooling

watch

  • Node.js is inspectable via the --inspect and similar flags
  • multiple tools, like DevTools (chrome://inspect) and VS Code integrate with it to allow debugging and profiling Node.js applications
    • DevTools includes dedicated Node.js window that auto connects to any Node.js process that is launched with the debugger enabled
    • in line breakpoints allow breaking on specific statement on a line with multiple statements
    • async code flow debugging is supported (async stack traces)

Profile JavaScript CPU Usage of Node.js via Chrome DevTools

read | read

  • launch your application with node --inspect app.js
  • open dedicated Node.js DevTools via chrome://inspect or by clicking the Node.js icon in the upper left of any DevTools window
  • select the profiler tab and click Start while loading your application and Stop when done
  • inspect the profile via the Chart view to see the timeline of function execution
  • select the Heavy (Bottom Up) view to see functions ordered by aggregated time spent executing them
  • in both views Cmd|Ctrl - F allows searching for specific functions, i.e. searching for garbage will highlight all instances in which garbage was collected
  • the time spent in program frames is either spent in native code or idling and if a lot of those frames it is advised to profile either via node --prof or kernel tracing tools like perf or dtrace in conjunction with flamegraph visualizers

cpu-profiler

Resources

Profile FullStack CPU Usage of Node.js

Tools To Produce Full Stack Flamegraphs

Brendan Gregg's Flamegraph Tool

read

  • takes perf or dtrace results as input and transforms them into flamegraphs in three steps using the included Perl scripts
  • when using perf and the Node.js process was run with --perf-basic-prof it is able to resolve JavaScript symbols
  • thlorenz/flamegraph is a port of those scripts to a JavaScript module and application to make it easier to use with the added benefit of resolving JavaScript symbols when using dtrace as well
  • 0x took ideas from both of these tools to provide a very easy to use solution to profile Node.js applications
0x
  • 0x combines all steps to generate flamegraphs into an easy to use package and is the recommended solution to produce flamegraphs for Node.js applications running locally
  • by default it adds the --prof flag when running your Node.js application and includes JavaScript as well as native symbols for V8 and Node.js, selectable in the toolbar of the visualization (V8 and cpp)
  • for cases in which lower level inspection is required, the command line api includes a --kernel-tracing flag which will use perf or dtrace (depending on OS) to profile your application and produce a flamegraph from its output
  • an interactive demo can be found here

Perf

read | read very comprehensive resource on perf with tons of examples

Only available on Linux.

perf record -g node --perf-basic-prof app.js
perf report ## add --no-children to avoid expanding nodes
  • visualize perf output via Brendan Gregg's Flamegraph Tool
  • 0x uses it under the hood to produce input when --kernel-tracing is enabled when profiling Node.js applications on Linux

Dtrace

read

Only available on BSD Unix, like Mac OSX or Solaris.

  • since dtrace accepts a full script, there are infinite ways of profiling applications with it
  • here is an example script that produces data that can become input to tools to produce flamegraphs
  • 0x uses it under the hood to produce input when --kernel-tracing is enabled when profiling Node.js applications on BSD Unixes
  • here is an example to do this manually for Node.js applications

Inspecting V8

V8 flags

Multiple flags and so called run time functions are available to anyone who likes to peek into the inner workings of V8.

AST

  • --print-ast prints the AST generated by V8 to the console

Byte Code

  • --print-bytecode prints bytecode generated by ignition interpreter to the console
  • provides more info when run with debug build d8, i.e. information about maps created

Tracing Inline Caches

  • --trace-ic dumps IC traces to the console
  • pipe that output into ./v8/tools/ic-processor to visualize it

Optimized Code

  • --print-opt-code prints the actual optimized code that is generated by TurboFan
  • --code-comments adds comments to printed optimized code

Tracing Optimizations

  • --trace-opt traces lazy optimization
    • generic ICs are bad as if lots of them are present, code will not be optimized
    • ICs with typeinfo are good

Tracing Map Creation

  • --trace-maps in combination with --trace-maps-details trace map generation into v8.log
  • --expose-gc allows forcing GC via gc() in your code to see which maps are short lived
  • the output can be parsed to see what maps hidden classes and transitions into other maps V8 creates to represent your objects
  • a graphical presentation is also available by loading the resulting v8.log into /v8/tools/map-processor.htm (requires V8 checkout)
Resources

Runtime Call Stats

  • --runtime-call-stats dumps statistics about the V8 runtime to the console
  • these stats give detailed info where V8 time is spent

Sample Output (abbreviated)


                      Runtime Function/C++ Builtin        Time             Count
========================================================================================
                                      JS_Execution      8.82ms  47.24%         1   0.11%
                              RecompileSynchronous      3.89ms  20.83%         7   0.75%
                                   API_Context_New      2.20ms  11.78%         1   0.11%
                             GC_SCAVENGER_SCAVENGE      0.88ms   4.73%        15   1.60%
                                AllocateInNewSpace      0.51ms   2.71%        71   7.59%
                       GC_SCAVENGER_SCAVENGE_ROOTS      0.38ms   2.04%        15   1.60%
                    GC_SCAVENGER_SCAVENGE_PARALLEL      0.24ms   1.29%        15   1.60%
         GC_SCAVENGER_BACKGROUND_SCAVENGE_PARALLEL      0.18ms   0.94%        15   1.60%
                         GC_Custom_SlowAllocateRaw      0.15ms   0.79%         6   0.64%
                      CompileForOnStackReplacement      0.13ms   0.70%         2   0.21%
                                      ParseProgram      0.13ms   0.70%         1   0.11%
                                   CompileIgnition      0.13ms   0.69%         4   0.43%
                      PreParseNoVariableResolution      0.11ms   0.59%         3   0.32%
                                      OptimizeCode      0.09ms   0.46%         5   0.53%
                                     CompileScript      0.08ms   0.41%         1   0.11%
                      Map_TransitionToDataProperty      0.07ms   0.40%        92   9.84%
                        InterpreterDeserializeLazy      0.07ms   0.36%        28   2.99%
                                  Map_SetPrototype      0.06ms   0.29%       249  26.63%
                                  FunctionCallback      0.05ms   0.27%         3   0.32%
                              ParseFunctionLiteral      0.04ms   0.22%         3   0.32%
                                  GC_HEAP_EPILOGUE      0.04ms   0.19%        15   1.60%
                                [ ...                   ...     ...         ...    ... ]
----------------------------------------------------------------------------------------
                                             Total     18.67ms 100.00%       935 100.00%
Resources

Memory Visualization

  • V8 heap statistics feature provides insight into both the V8 managed heap and the C++ heap
  • --trace-gc-object-stats dumps memory-related statistics to the console
  • this data can be visualized via the V8 heap visualizer
    • make sure to not log to stdout when generating the v8.gc_stats file
    • NOTE: when I tried this tool by loading a v8.gc_stats generated via node --trace-gc-object-stats script.js > v8.gc_stats it errored
    • serving v8 ./tools/heap-stats locally had the same result
Resources

Array Elements Kinds

  • enable native functions via --allow-natives-syntax
  • then use %DebugPrint(array) to dump information about this array to the console
  • the elements field will hold information about the elements kinds of the array

Sample Output (abbreviated)

DebugPrint: 0x1fbbad30fd71: [JSArray]
 - map = 0x10a6f8a038b1 [FastProperties]
 - prototype = 0x1212bb687ec1
 - elements = 0x1fbbad30fd19 <FixedArray[3]> [PACKED_SMI_ELEMENTS (COW)]
 - length = 3
 - properties = 0x219eb0702241 <FixedArray[0]> {
    #length: 0x219eb0764ac9 <AccessorInfo> (const accessor descriptor)
 }
 - elements= 0x1fbbad30fd19 <FixedArray[3]> {
           0: 1
           1: 2
           2: 3
 }
[…]
  • --trace-elements-transitions dumps elements transitions taking place to the console

Sample Output

elements transition [PACKED_SMI_ELEMENTS -> PACKED_DOUBLE_ELEMENTS]
  in ~+34 at x.js:2 for 0x1df87228c911 <JSArray[3]>
  from 0x1df87228c889 <FixedArray[3]> to 0x1df87228c941 <FixedDoubleArray[22]>
Resources

Tools to Inspect/Visualize V8 Operations

Turbolizer

  • when Node.js or d8 is run with the --trace-turbo it outputs one turbo-*.json file per function
  • each JSON file includes information about optimized code along the various phases of Turbofan's optimization pipeline
  • turbolizer is a tool derived from the turbolizer application included with V8
  • it visualizes the TurboFan optimization pipeline information and provides easy navigation between source code, Turbofan IR graphs, scheduled IR nodes and generated assembly code

Considerations when Improving Performance

Three groups of optimizations are algorithmic improvements, workarounds JavaScript limitations and workarounds for V8 related issues.

As we have shown V8 related issues have decreased immensely and should be reported to the V8 team if found, however in some cases workarounds are needed.

However before applying any optimizations first profile your app and understand the underlying problem, then apply changes and prove by measuring that they change things for the better.

Profilers

  • different performance problems call for different approaches to profile and visualize the cause
  • learn to use different profilers including low level profilers like perf

Tweaking hot Code

  • before applying micro optimizations to your code reason about its abstract complexity
  • evaluate how your code would be used on average and in the worst case and make sure your algorithm handles both cases in a performant manner
  • prefer monomorphism in very hot code paths if possible, as polymorphic functions cannot be optimized to the extent that monomorphic ones can
  • measure that strategies like caching and memoization actually result in performance improvements before applying them as in some cases the cache lookup maybe more expensive than performing the computation
  • understand limitations and costs of V8, a garbage collected system, in order to choose appropriate data types to improve performance, i.e. prefer a Uint8Array over a String when it makes sense

Resources