Profiling

Games and apps must run well to deliver acceptable experiences. When performance issues are encountered, optimization is required. The best way to improve a bad performance scenario is a simple loop: measure, optimize, repeat until performance is better. Measuring performance to identify what is slowing down your app helps you direct your development effort for maximum return - that is, fix the slowest part first!

How to measure? Loom provides a rich set of profiling tools. They are simple but effective.

Let's begin by profiling something. First, start your application normally with:

$ loom run

then at the command console enter:

$ profilerEnable

You should then see:

[asset.protocol] LOG: Enabling profiler...

This means the profiler is running and generating metrics for both object allocation/deallocation and timings on methods being called.

Now, do something meaningful in your application. In terms of looking for memory leaks, this might mean cycling between screens or playing a couple rounds of a game - anything that exhibits subpar performance. Once you have done this enter the following into the command console:

$ profilerDump

You will be rewarded with a huge amount of information on what Loom was doing while you ran the profiler. The following sections explain what this output means and how to use it.

Advanced Note: You may drive the profiler programmatically via the system.Profiler API. profilerReset will reset all counts in the profiler.

Telemetry

Loom provides a telemetry server, which streams a variety of useful timings and other metrics from the app to the Loom console. It provides a visualization of the underlying runtime systems in real-time.

You can enable or disable it by typing .telemetry in the Loom console or by setting the telemetry config value to true (loom config telemetry true on the command line).

If you can see that the telemetry server is running and that telemetry has been enabled in the console after running with loom run, you should be all set. Just open up your preferred browser at http://localhost:8073/ and you should see the Loom Telemetry interface.

The main chart displays a real-time view into the runtime ticks being executed. Scroll and drag on the x axis where the tick numbers are shown to see more or fewer ticks. Scroll on the bars themselves to resize the tick vertically and see different parts in more detail, drag it up and down to bring them into view.

Zoom into a specific tick far enough to see a more detailed breakdown of the time different subsystems take. Within a single tick, the different timing sections are arranged in a hierarchy from left to right with the vertical position and size representing the start time relative to the tick and the duration of the section.

The telemetry system works whether the profiler is enabled or disabled. If the profiler is disabled, it only shows the native timing blocks, if it is enabled, it shows the timings of all the LoomScript functions called as well.

Tracking Performance

Loom tracks detailed timing information on both the script and native sides to give you a complete view on performance. In the output from the profiler, you will see two reports - the "Ordered by non-sub total time" and the "Ordered by strack trace total time" reports.

The first, the non-sub time report, shows the time spent in a function less the time spent in its children - its "non-sub" time. In other words, this report shows the functions taking up the most execution time globally. Suppose you have a function A that has little code, and does its work by calling another function B many times. A will show up low on this report, while B will be near the top. Here's some example output:

 Ordered by non-sub total time -
 %NSTime  % Time  Invoke #  Name
  30.436  81.067      590 loom.Application.onInternalFrame
   4.133   5.762       93 loom2d.display.DisplayObjectContainer.getChildEventListeners
   3.782   7.344       97 loom2d.text.BitmapFont.arrangeChars
   3.716  41.502      961 loom2d.events.EventDispatcher.invokeEvent
   2.703   2.706        1 loom.Application.onProfilerDump
   2.289   2.289     1560 system.Vector.Vector
   ... trimmed for length ...
 Suppressed 104 items with < 0.1% of measured time.

The above shows that 30% of all observed time was spent in Application.onInternalFrame! Why might this be? It's an internal function that's run on every frame. Because the profiled application was running very well, not very much time was measured and even a lightweight function like onInternalFrame will show up high in the profiler's report. Try running the profiler with a slower program and you'll rapidly see your code rise to the top!

However, while the above report gives us some good insights on the slowest parts, we don't see context. Suppose that some utility function - like VertexData.setPosition - was high on the non-sub time report. This function is hard to micro-optimize because all it does is shuffle a little data around. But it could be being used in a very non-optimal way - perhaps a custom UI component is updating itself too often. So we want to see when it is being called and why.

This is where the stack trace report comes in. It shows program execution as a tree. Each function's calls are shown as children in the tree, and the whole thing is sorted to show the slowest branches first. Here's a snippet from a real profiler dump:

 Ordered by stack trace total time -
 % Time  % NSTime  Invoke #  Name
 100.000 100.000        0 ROOT
  81.067  30.436      590   loom.Application.onInternalFrame
  39.646   1.615      590     display.Stage.advanceTime
  37.650   1.282      590       events.EventDispatcher.dispatchEvent
  35.739   2.608      590         events.EventDispatcher.invokeEvent
  30.664   0.554      590           feathers.core.ValidationQueue.process
  13.088   0.094       35             text.BitmapFontTextRenderer.validate
  12.584   0.373       35               text.BitmapFont.fillQuadBatch
   4.417   0.113      352                 display.QuadBatch.addImage
   4.304   0.778      352                   display.QuadBatch.addQuad
   3.526   0.243      352                     display.Image.updateVertexData
   2.823   1.015      352                       SubTexture.adjustVertexData
   0.794   0.568     1408                         utils.VertexData.getTexCoords
   0.226   0.226     1408                           utils.VertexData.getOffset
   0.717   0.494     1408                         utils.VertexData.setTexCoords
   0.223   0.223     1408                           utils.VertexData.getOffset
   0.225   0.225     1408                         math.Point.__op_assignment
   0.459   0.459      352                       utils.VertexData.copyTo
   2.848   0.618      352                 display.Image.__pset_texture
   1.673   0.420      299                   display.Image.readjustSize
   0.597   0.412     1196                     utils.VertexData.setPosition
   0.185   0.185     1196                       utils.VertexData.getOffset
   0.548   0.245      299                     Texture.__pget_frameReadOnly
   0.153   0.107      299                       SubTexture.__pget_width
   0.150   0.105      299                       SubTexture.__pget_height
   0.287   0.194      598                   SubTexture.__pget_premultipliedAlpha
   0.114   0.114      299                   textures.Texture.__pget_nativeID
   2.761   1.441       35                 text.BitmapFont.arrangeChars
   0.618   0.231      740                   text.BitmapChar.__pget_width
   0.387   0.265      740                     textures.SubTexture.__pget_width
   0.121   0.121      740                       ConcreteTexture.__pget_width
 ... trimmed for length, package names also reduced for width ...
 Suppressed 433 items with < 0.1% of measured time.

As you can see, a lot of code is showing up! But before we jump to conclusion, let's dig into the numbers.

First, it's often useful to look at the invoke counts. They give a picture of how often things are being run. We can spot frame counts by looking for our old friend onInternalFrame - the name suggests that it's called every frame and you'd be right to think that. For instance, we can see that onInternalFrame was run 590 times in the profiling period, but BitmapFontTextRenderer.validate was only run 35 times. So this branch of execution, which dominates the top of our profiler dump, was run less than 10% of frames - not so much of a concern now, is it?

We can also look at the non-sub and total time percents for the BitmapFontTextRenderer branch - 13% of total time is a fair chunk. However, before you rewrite the validate call, consider that the non-sub time (the time spent just in that function) is 0.09%. If you want any wins they'll have to be found deeper in the call stack.

So what do we know now? Well, we can see that for our profiling code not much code was run, and the profile is pretty flat - that is, most code paths take about the same amount of time. The only outlier is onInternalFrame, and this only appears to be taking a huge chunk because the app is doing so little - so doing anything takes a lot of time! In a real app or game, business logic, physics, graphics, or AI code would likely dominate the profiler.

Methods with __pget_ in them are getters, by the way, and __pset_ are setters. Functions starting with __op_ are operators.

Tracking Memory

If performance is time, memory is space. Naturally, Loom's profiler tracks both. In the profiler output, you will see a hierarchical list of objects created and destroyed since profiling began, broken out by the method doing the allocation.

For instance, here is one section showing a single object's allocation behavior (ignoring logger prefix output, and with package names trimmed for width):

    Alive: 14, Total: 45, Type: skins.SmartDisplayObjectStateValueSelector
         Alive 0, Total 4 (MetalWorksMobileTheme.buttonGroupButtonInitializer)
         Alive 4, Total 8 (MetalWorksMobileTheme.buttonInitializer)
         Alive 8, Total 31 (MetalWorksMobileTheme.itemRendererInitializer)
         Alive 1, Total 1 (MetalWorksMobileTheme.simpleButtonInitializer)
         Alive 1, Total 1 (MetalWorksMobileTheme.toggleSwitchTrackInitializer)

This is letting us know that since profiling began, 45 SmartDisplayObjectStateValueSelector were created and 14 are still instantiated (31 were created and garbage collected to arrive at the reported total of 45). We can also see which specific methods both created and destroyed the instances to find possible instance leaks. In this case, the itemRendererInitializer method is the largest source of "churn" or allocations and destructions.

If the number of "Alive" instances grows continuously, this is a good indication of a leak.

If you see methods with high churn, it usually means that you are creating temporary objects. If those same methods show up as hotspots in the profiler, or you see a lot of time spent in the GC, it's a huge sign that you should reuse objects instead of constantly recreating them.

: