Bjoern Knafla

Parallelization + AI + Gamedev Consulting 
Filed under

gamedev

 

Parallelization of Game AI - Presentation from the Paris Game AI Conference '09

In June 2009 I gave an overview and motivation for Parallelization of Game AI at the Paris Game AI Conference '09. The conference was a blast - great discussions with many interesting people and inspiring presentations - look out for the next installment, I will do so for sure!

Alex and Petra Champandard from AiGameDev.com, the main organizers and master-minds behind the Paris Game AI Conferences, just published the video and audio recordings of my presentation and put it online for AiGameDev.com premium members. Petra did a great job editing the recordings and put in a lot of hard work to synchronize my talk with the downloadable slides.

Video and slides at AiGameDev.com:

Slides with speaker notes can also be found at bjoernknafla.com:

Slides-movie and brief presentation recap

I find it extremely awful to watch and listen to myself / my presentation. I worked very hard on the talk and created slides that were full of animations to explain the presented parallelization concepts - making the slides a kind of script to follow during speaking. A script that needs accurate transitions and explanations under time pressure to stay inside the presentation time slot. This rigid structure to follow in combination with my nervousnes lead to lots of "ahs" and "errs" and let me struggle with my English speaking skills. To amend for th is, I created a quicktime movie from the slides showing the actual animations and added a coarse talk outline below. This in combination with the speaker note extended slides and the video recording at AiGameDev.com should give you much of the useful information of my talk.

(download)

Introduction

  • Why we need parallel programming: parallel hardware, performance.
  • Parallel programming isn't easy - but no rocket-science either:
    • parallel programming errors: race conditions, deadlocks,
    • performance problems: contention,
    • be aware that order of parallel instruction execution isn't deterministic.
  • Parallel hardware dictates how to program for performance:
    • maximize throughput - to fight memory latency,
    • maximize memory locality - to save bandwidth,
    • minimize (false) sharing - to prevent contention,
    • combined: balance maximization of locality vs. minimization of sharing.
  • Artificial intelligence (AI) model referenced in the talk: agent-based AI.

References:

Parallelization approaches

Asynchronous calls for pathfinding and other service calculations

  • Run computations asynchronously.
  • Better memory locality as async computations work on their own data.
  • Control iterations and when to merge result if determinism is important.
  • Uncontrollable scaling with raw threads (over-, under-subscription of hardware and performance penalties for numerous thread creation and destruction per main loop cycle, and context switching).
  • Take away: don't use raw threads but task pools or other constructs to express concurrency but not manage it by hand.

References:

  • James Reinders, Intel Threading Building Blocks, O'Reilly, 2007 (cross-platform C++ task pool by Intel)

Parallel agents

  • Unorganized parallel execution of agents needs synchronization of shared data.
  • Unstructured memory accesses and unorganized syncing:
    • saturates memory bandwidth,
    • is bad for data locality,
    • is not deterministic, and
    • potentially leads to contention and performance degradation.
  • Game entity management is ownership management (think netword and distributed programming).
  • Defer all non-read accesses: collect write requests and deterministically resolve conflicts (e.g. based on entity ids).
  • Identify data:
    • dependencies,
    • reading and writing access,
    • public data readable by all entities,
    • private data only accessible by entity owning it.
  • Execute entity updates in two stages:
    1. collection stage: all entities read public data (from last game world state) in parallel (without side effects),
    2. barrier synchronization waits on all agents to finish their collection stage, followed by the
    3. instance update stage: all entities update their private entity data in parallel.
  • Think about double buffering for public game/entity states.
  • Hint: for determinism and if needed use a random number generator per entity instance.
  • Stage internal: unstructured (read-)access to data can result in bad locality and eventually saturates the memory bus (bandwidth and latency problems).

Reference:

Data parallel systems

  • Each collection stage of parallel agents is different because of different agent state and surrounding game world.
  • Each agent works on different aspects of its behavior in individual order and duration: gameplay logic, sensing, pathfinding, terrain analysis, etc.
  • As already shown - this is bad for data locality and memory bandwidth usage.
  • Slice agents into aspects and group all aspects of the same type into systems.
  • Result: no single update function per agent but an update function per aspect/system.
  • Independent systems can run in parallel to each other.
  • Integration of agents aspects computations is defered based on the order of system execution per main loop cycle.
  • Systems can internally optimize their data: enables efficient parallel processing.
  • Systems run through stages:
    1. Collect data and update requests.
    2. Prepare parallel computation on data and update requests.
    3. Compute and transform aspect data in parallel (data-parallel computation).
    4. Publish results (pull, or push, or message passing, double buffering).
  • Systems enable lean data structures with good locality.
  • Data-parallel computations typically scale well with increasing processor core counts.
  • Order-irrelevant data processing enables determinism.
  • Systems might need internal low-level parallelism and synchronization.

References:

Conclusion

Know your parallel hardware, program for throughput, organize memory access, balance maximization of locality vs. minimization of sharing. Use the parallelization approaches as guides, and always remember: it is easy to introduce hard to find parallel errors - therefore: KISS - keep it short and simple.

References and further reading:

Outlook

I am planning a number of blog articles to delve more deeply into the parallelization of games and game AI, and to take a look at parallelization approaches I had to drop from the talk because of time constraints, like:

If you have got questions, feedback, criticism, then feel free to leave a comment or contact me via email or twitter. I am looking forward to hearing from you!

 

Bjoern

Loading mentions Retweet
Filed under  //   game ai   gamedev   parallelization  

Comments [4]

Parallelization in Games: Codemasters Colin McRae Dirt 2

I am always on the hunt for gamedev industry parallelization approaches deployed in computer and video games. Typically I scribble down some bullet points about game parallelization infos I find to use as a reference later on - so: why not share it on the web? Here we start:

Todays issue of the Intel Visual Adrenaline Magazine (issue 4, 2009, web, pdf) contains the article The Muddy Beauty of DiRT* 2*: Where rally realism meets the raw road by Jon Jordan about Codemasters Colin McRae DiRT 2 ralley/driving game with some bits of technical details provided by Codemasters Gareth Thomas (senior graphics programmer).

DIRT2_morocco_04aDIRT2_morocco_04a picture by Codemasters, on Flickr

Following is a short overview of the parallelization info revealed in the article:

  • Physics engine runs at minimum of 60 Hz - critical parts of it are simulated at 1,000 Hz
  • PlayStation 3 and Xbox 360 as lead platforms, single- and multi-core PCs are also supported
  • A tweaked version of Codemasters EGO game engine provides the main multi-threading and parallelization framework
  • Task parallelism: rendering happens on one core, game logic and physics on another/other core(s)
  • Tasks can be split off to other cores if available (It is unclear to me if the rendering, logic and physics tasks are meant here, or if smaller tasks from within these systems are described to allow scaling of performance or eye-candy with the number of cores)
  • Processes are parallelized depending on the number of cores and threads of the host platform
  • Worker maps specify how many threads the different subsystems use and the mapping between threads and actual cores
  • XML files describe the worker maps for different platforms and parallel hardware configurations based on profiling
  • In combination with different graphics resolutions for different systems, and enabling and disabling certain features like real-time environment maps on cars or dynamic shadows, worker maps enable hand-tuned scaling from lower-end to top-of-the-line machines
  • Occlusion culling is done via offline pre-computed potentially-visible-set tables and real-time lookup into these tables
  • At the beginning of a rendering frame visibility queries for main scene, shadows, and reflections are invoked on multiple threads. Results determine terrain and graphical object resultion/level-of detail and which hidden objects to cull

Summary and comments

Intel Visual Adrenaline Magazine articles typically don't delve deeply into technical details - and this article is no exception.

From the information that is revealed it looks like Codemasters EGO engine is a classical task parallel solution which models certain functional components of games (e.g. graphics, physics, game logic, etc.) as tasks. These tasks are statically mapped to specific threads/cores while satisfying inter-task (data-) dependencies. The mapping is controlled via XML files called worker maps.

I would have liked to get more information how the different tasks are distributed to cores and threads to learn if the parallelization approach described could scale to increasing core counts or where the scaling limits lie. The worker maps and the description of the task parallel system indicate a strong thinking in threads and a coarse-grained approach to parallelism (whole functional components/tasks are subject of threading). While this gains strong control of hardware-usage (which task to map to which core/thread) on todays multi-core machines it often also serverly limits scalability with increasing core counts.

In addition to the platform-static threading, other operations (e.g. visibility queries) can be invoked on available service threads. Depending on the am ount of operations/queries this can easily scale with core/thread count. The article doesn't explain if the queries are handled as jobs and entered into a job pool which then schedules them to available worker threads created at game start-time and destroyed at game-end-time, or if threads are created on the fly per query (which isn't advisable as thread creation and destruction is quite computationally expensive when done too often per main loop cycle).

I would be delighted if anyone with more legally shareable insight into the parallelization of Codemasters EGO engine could comment on the open questions and perhaps provide even more detailed insights into the technology behind the physics sim heavy and visually spectacular DiRT 2 game.

 

Bjoern

Loading mentions Retweet
Filed under  //   game   gamedev   parallelization  

Comments [0]