I promised myself ages ago I would write a blog about this, more for the PS2 community than anything else as this seems to be almost a dark art in terms of understanding how it works.
Some of you may be familiar with Path 3 masking, some of you may not. In any versions around 0.9.4 or older, have you ever played a game where say some of the floor textures have had writing on them or just looked like absolute crap? Well the likelihood is that was due to Path 3 Masking problems. For those not familiar, here is a picture example of persona 3 displaying Path 3 masking issues.
So, what is it exactly?
Path 3 masking is a method of synchronizing the order in which geometry data (polygons etc) and the textures that go on them are sent to the GS (Graphical Synthesizer). It was a pretty efficient method of transferring information, which took completely different path routes to make optimal use of the PS2 Memory BUS and used a stalling method to keep the texture and geometry lists in order instead of interrupts which took extra cpu time. This way, developers could queue up massive amounts of textures and stream them to the GS in time with the geometry data efficiently.
Here is a badly drawn diagram showing what the Mask does:
As you can see, the mask stops PATH3 from transferring while the VU is busy, then lets it go and almost immediately puts the mask back on, this ensures that at the end of the texture transfer, PATH3 stops again!
So what was the problem? Why did everything look like crap?
Truthfully? we never had proper control over these packets. The GIF packets can come through the PATH3 DMA channel lumped together, where in actual fact, we should have been stalling half way through it, but in the emulator, we didn't really care what these packets were, so we just threw it all at the GS and ignored any packet ends and this is what caused the Texture/Geometry lists to go out of sync, this is where the crap on the screen came from. As the emulator evolved it became progressively easier to fix this issue, although we had the theory down, developers would handle it in different ways with different timings for the masks, so it took a fair amount of testing and tweaking to get right.
Fortunately these days we have got this pretty much solved and we completely analyze the GIF Packets before sending them on their way, so we know where we can stop the transfers, however we do still get the odd time where this rears its ugly head, due to how we have to time everything perfectly, but due to the nature of emulators, timing is painful to get correct and will probably never be so. Generally however, using gamefixes such as "EE Timing Fix" can get around these issues, without compromising too much on stability.
Well if you've kept up with
pcsx2's SVN, you'll notice we
recently added a MTVU (Multi-Threaded microVU1) option, which runs VU1 on its own thread.
Getting pcsx2 to use more cores is something many people have asked for, and they wondered why we weren’t doing it. Some users would go as far as to flame us pcsx2 coders saying we didn’t have the skills to do it or would say some other nonsense.
In reality, there are plenty of reasons why it’s taken so long for us to code something like MTVU. The key reason is that emulation in general does not lend itself well to threading components. If you want to be safe and accurate at a hardware emulation level, you generally take the approach of running the core CPU of the system (the EE-core in ps2’s case), and then based off the cycles you ran the CPU for, you time the execution of all the other processing units in the system (DMAC (GIF, VIF, SIF, …), VU0/VU1, IPU, IOP, etc…).
If you try to naively thread the components in the above model, you’ll have 2 huge problems which may not be apparent to non-emu coders. One problem is that threading the components (say GIF and VU1) at the same time, will lead to very bad syncing errors if the GIF and VU1 need to communicate with each other (and they do). The other problem is that just running a component on a thread is actually going to be a slow-down unless the thread is doing a lot of work without having to sync with other components. As an example, both Jake Stine (Air) and I have in the past experimented with naively threading VU1, and in addition to getting unstable graphics and crashes, our attempts also ran slower.
So now you may be wondering, “If threading has all those problems how can you get a speedup?” Well we had to be smart about it, and figure out a way to make threading work based on the properties of the hardware. One thing we know is that for a component to even be worth-while to thread, it needs to have relatively low interaction with other components (which will limit the amount of syncing needed with other components; and therefore increase the thread throughput), and it also needs to be a component that is used frequently and is computationally expensive to emulate (or else there’s no point in threading it). Luckily the GS fit that bill very well (which gave us MTGS), and to a lesser extent VU1 does (giving us MTVU).
![[Image: attachment.php?aid=29491]](http://forums.pcsx2.net/attachment.php?aid=29491)
(Some key components in the ps2; these aren’t drawn to any relative scale)
In the above diagram, I have drawn some of the key components which must be considered when threading VU1 and the GS. The dark arrows show common data paths where data is transferred frequently between the components in games. The white arrow heads show data paths that are possible in games, but are rarely used due to various reasons. There are some more data paths and more components to the ps2, but they’re excluded for simplicity (the actual grouping of the components in the ps2 also is a bit different, but drawing them that way would complicate the diagram).
Just so you’re not left in the dark I’ll briefly explain what the above components are used for. The EE-core is the main CPU of the ps2 and mostly handles game logic; using the GIF FIFO and Path 3, games generally send image data (textures) to the GS (the gpu of the ps2). VU1 (Vector Unit 1) is used for coordinate, matrix, and vector calculations games need to do for 3d graphics; when it’s done calculating it sends the processed data over to the GS via GIF’s Path 1. The VIF1 unit is in charge of sending micro-programs to VU1, unpacking compressed data to VU1’s data memory, signaling VU1 to start executing its micro-program, and it also can send data to the GS via GIF’s Path 2. The GIF Unit is in charge of managing the Path 1/2/3 transfers that are being sent to the GS, and it does this according to path priority (Path 1 > Path 2 > Path 3) and some other specific rules.
If you notice in the diagram, the GS (Graphics Synthesizer) has data coming into it frequently, and rarely ever needs to send data back to other components. When threading the GS, what we essentially do is buffer GIF transfers that are being sent to the GS, and then once we have enough buffered data we kick-start the MTGS thread and have it start processing the GIF packets we previously buffered. While it’s processing this buffered data, the EE main thread is still running and more GIF transfers are being buffered so that MTGS can process them when it gets the chance…
The key to making something like this work from an emulation standpoint is that the main emulation thread thinks that nothing is being threaded at all; it sends data to the GIF Unit like normal, and this gets processed according to the timing information from the EE core. However that GIF transfer data is being buffered which allows the MTGS thread to process the buffered data in parallel with the main emulation thread. If you notice though this design means that the MTGS thread is actually running behind the main emulation thread in terms of the data it’s processing, because the buffered data is old data that the main emulation thread has generated. The only time this can be a problem, is if the main emulation thread ever needs to read back data from the MTGS thread. In this case what needs to happen is the main emulation thread needs to stall and wait until the MTGS thread is done processing. Once that is done, the MTGS thread should have up-to-date status information, and the main emulation thread can read back the appropriate data from the MTGS thread. This syncing is slow and if emulation needs to do it a lot it eliminates the speedup from threading (and can even be a big slowdown), however as noted in the above diagram, the GS rarely ever needs to transfer data back; so that means threading the GS is almost always a speedup for us.
Now that we understand the basics of how the MTGS thread works, we can move onto MTVU. Since the GS was already the best candidate for threading in a ps2 emulator, it meant that threading any other component would be even more of a challenge. From analyzing pcsx2’s activity and design, the next component that made sense to thread was VU1. It is relatively isolated compared to the other components, and it takes a lot of computation power to emulate. However, there are some big problems which must be solved to get a fast and working threaded VU1.
One of the problems deals with what was talked about in the beginning of this article, how typical emulation runs the different components based off the cycles of the main CPU. With the typical emulation approach we would run the EE for say 2048 cycles, and then we would run VU1 for its equivalent amount of cycles which would be 1024 vu cycles (VU1 runs at half the clock speed of the EE). This approach would be very problematic for us with threading, because it would mean VU1 doesn’t run long enough for threading to be a speedup. However we are very fortunate because on the PS2 games can’t be as cycle-critical as they are with older systems. Any well coded ps2 game has to have thread safe code and account for possible timing inconsistencies (if the game doesn’t do this, then it likely won’t be stable on the real console, or between minor variations of the console). This means we can generally get away with running VU1 programs disregarding EE timing completely, and pcsx2 has actually been using this approach for VU1 for years now (because even in a non-threaded design its faster to emulate this way). Instead of running VU1 for X amount of cycles based off EE cycles, we have code it so that VU1 executes its complete micro-program till it’s finished. This is great news for threading VU1 because it means that we can run VU1 on a thread regardless of any EE timing and it won’t be a problem. It also means that since VU1 has been running in its entirety this whole time, from the EE’s point of view it has always seen that VU1’s status is ‘finished’ (because it never gets a chance to read the status as ‘executing’). This means that you can actually queue multiple VU1 programs, and run them at a later time as far as the EE is concerned, because it doesn’t really care about VU1’s status. This is even more good news for threading since it means more through-put for the VU1 thread if it can be queuing and sequentially executing multiple VU1 programs (as opposed to running only one VU1 program while the EE waits on it to finish).
There is another big problem with VU1 threading however: the VU1 processor works closely with the GIF Unit, and the GIF Unit is timed and executed according to the main emulation thread. So that means sending data to the GIF Unit from the VU1 thread would be at totally incorrect times with respect to core emulation timing. Now our solution for this is a bit tricky. What we end up doing is assume every VU1 program will send data to the GIF Unit. So immediately when the main emulation thread finds out it needs to execute a VU1 program, it sends a fake GIF transfer packet to pcsx2’s GIF Unit (and immediately after starts executing the VU1 program on the MTVU thread). The GIF Unit then acts like normal on the main emulation thread with its path arbitration and eventually uploads this fake GIF packet to the MTGS thread. The MTGS thread acts like normal telling the GS plugin to process real Path 2/Path 3 GIF packets until it reaches the fake Path 1 packet. When it does this, the MTGS thread asks the MTVU thread if at least 1 VU1 program has finished since the uploading of the fake GIF packet. If it hasn’t, then the MTGS thread stalls until the MTVU thread can finish a VU1 program. If the MTVU thread has already finished a VU1 program, it signals the MTGS thread and then the MTGS thread wakes up and starts talking to pcsx2’s GIF Unit. At this point the GIF Unit should have a Path 1 packet pending which the MTGS thread can process, because while the MTGS thread was waiting, the MTVU thread had been communicating with the GIF Unit and uploading real Path 1 packet data asynchronously to a buffer. When the MTVU thread finishes the VU1 program, just before it signals the MTGS thread that it’s done, it talks to the GIF Unit and tells it that all the Path 1 data that was just transferred to it should be considered a full Path 1 packet (and if no data was written to it, consider it a full Path 1 packet with “size = 0”). So going back to the MTGS thread, at this point it requests a full Path 1 packet from the GIF Unit, and the GIF Unit gives it one of the Path 1 packets that the MTVU thread has finished. Then the MTGS thread tells the GS plugin to process this Path 1 packet, and the cycle just continues over and over every time a VU1 program is run.
The only flaw in the above solution for GIF processing is that in reality pcsx2 needs to do some additional processing of GIF packet data on the main emulation thread to check if certain privileged GS registers are being written to (SIGNAL/FINISH/LABEL). With the approach described above, we first send fake GIF packet data that the GIF Unit processes instead of the real one; but if the game was really supposed to send a GIF packet which writes to the GS’s SIGNAL/FINISH/LABEL registers, then we don’t know about it and we’re out of luck. The game in this case will likely hang. There is no 100% reliable solution to this problem that still allows a fast threaded VU1, because in order to know for sure that those privileged registers will not be written, you need to execute the VU1 program and find out that it is not sending a GIF packet which writes to them. The best you can do is predict what the VU1 program will do based off what it did last time, but this is still not fully reliable, and a lot more complicated, it also will have a lot more overhead compared to the way we’re currently threading VU1, so in my opinion it’s not worth it. The good news is that it’s very rare for a game to write to the privileged registers from VU1’s GIF Path 1 (they usually do it from Path 2/Path 3), so that means that MTVU should be very compatible, however due to this potential problem, MTVU has to remain a “speedhack” instead of adding it as a normal option.
The third big problem with VU1 emulation is the communication with VIF1. The VIF1 if you remember is in charge of sending new micro-programs to VU1 and decompressing data onto VU1 memory. In addition to this it has commands to tell VU1 to start executing, and it has commands to wait on VU1 to finish executing. All this VIF1 - VU communication screams problems for a threaded VU1 design. However we have a clever solution for this too. What we essentially do is duplicate VIF1, we have one instance on the main emulation thread, and we have another instance on the MTVU thread. Whenever VIF1 is told to do something on the main emulation thread, it runs like normal until it reaches a command which interacts with VU1. If the command is to wait on VU1, it just ignores it, because remember pcsx2 already has been used to VU1 running in its entirety, so VU1 is always complete when VIF1 has a command to wait on it; so there’s no need to wait on VU1. If the command is to transfer data to VU1’s memory, things get a tad more complex. What we do is write the command in the MTVU thread’s internal ring buffer. The MTVU thread is essentially just a ring buffer with queued commands which get processed in-order. The VIF1 writes a command in MTVU’s ring buffer to say “upload 1024 bytes of data to VU1 micro-memory”, and then the 1024 bytes of data is also written to the ring buffer. Next VIF1 can send a command that says “execute VU1 with PC address = 0x1000”, and this command gets written to MTVU’s ring buffer as well. Finally MTVU decides it’s time to start executing commands in its ring buffer, and it sees the VIF1 command that writes 1024 bytes to VU1 micro-memory, and it executes this command on the MTVU thread. Then it sees the command to start VU1 executing at address 0x1000, and it starts executing the microVU1 recompiler with PC = 0x1000. This is the basics of how it works, the reason I said we have to duplicate VIF1 between the main emulation thread, and the MTVU thread, is that some VIF1 commands like VIF Unpack rely on status information from VIF1’s registers. To get this working on the MTVU thread we need a separate state of VIF1 registers and status information, because the VIF1’s status on the main emulation thread would be different by the time the commands are executed on the MTVU thread. Essentially from this description, you can see that the MTVU thread not only threads VU1, but also threads VIF1 command too. This is another speedup since VIF1 Unpacks are very computationally expensive (and that’s why we even have a VIF Unpack recompiler).
The remaining problems of VU1 threading are handling the cases where the EE or other processors like VU0 ever need to read back from VU1. This happens very rarely, but in this situation all we need to do is call a “Wait on VU1 Thread” function immediately before trying to read from VU1 memory. If it happened often it would ruin any chance of speedups threading VU1 had, but luckily it rarely happens. The EE actually works very closely with VU0 as opposed to VU1, and so threading VU0 would not be a speedup because the EE would end up reading back from it too much. The good thing is that VU0 is rarely a bottleneck in games; this is evident because you can usually run VU0 interpreters and get a minimal speed-hit (if you try to run vu1 interpreters on the other hand, your speed will usually crawl to ~2fps).
By now you should hopefully have a better understanding of how pcsx2’s VU1 and GS threading basically works, but you may still have questions on why it’s taken so long for us to do it. Well as shown above there were a lot of problems with threading VU1 which needed to be solved, and we weren’t sure about the best ways to handle them. Another huge reason is pcsx2 was not in a good state to make VU1 threading a reality until recently. We didn’t have a centralized GIF Unit like the ps2 does; instead we had code that was all over the place and running the GIF paths without proper scheduling. The new GIF Unit rewrite solved this problem for us. Another problem was that Super VU1, the old VU1 recompiler, was not thread safe (it combines stuff for VU0 and VU1 emulation together), so a new thread safe recompiler was needed, which was one of the goals for the microVU recompiler I wrote. Another problem was the sloppiness of a lot of old pcsx2 code which had needless inter dependencies between various other code modules; we needed to isolate the related code with code refactoring and rewrites to clearly separate the code and make it thread safe. We also needed a thread-safe code emitter which allowed us to run multiple dynamic-recompilers in parallel; we solved this by a code emitter rewrite and using thread local storage for the emitter’s global data. Lastly we needed someone who knew about all these various components and was bored enough to try making something like this work
Some forum members had shown quite an interest in the history of the emulator, so i thought, why not? I'll write a history of the emulator to the best of my knowledge for everybody to look at! Hopefully those who have been here longer (like bositman) can fill you in a bit more on what happened. My apologies for any inaccuracies, I didn't join the team until version 0.8.0 (January 2005)! So, here goes.....
Around the middle of 2001 (Roughly) the PCSX2 project started with just 2 people, Linuzappz and Shadow, who formally coded the PS1 emulator named PCSX. Having finished that project to a point they deemed successful, they decided to embark on a new project which at the time, was completely unheard of, a Playstation 2 emulator, so PCSX2 was born.
They were later followed by other coders such as auMatt, Loser, florin and Saqib, who at the time was known as asadr. This small band of coders with a big dream, little documentation and lackluster hardware (at the time, it was good gear!) managed to forge something together which vaguely simulated a playstation 2, but only to the lengths of running simple homebrew software, which had its doors opened by this small achievement. Not content with the fact they had developed a homebrew emulator, the guys wanted some genuine PS2 software to run, so they picked a few simple games (such as Bust-a-move) and got to work.
Many revisions later and lots of plugin development, they managed to get some games to show loading screens and some even ingame footage, this was a massive achievement for the group, proving to the world that this proof of concept emulator was a reality, showing the world that it was possible.
To show a true emulation image of a system such as the PS2, there is one behemoth which must be conquered to give it that feel which the playstation 2 has, the PS2's own BIOS file. This was an extremely complex, tricky bit of software to emulate. From what i've been told, this took days of solid coding, hacking and debugging and reading of Assembly language to achieve, however when they did eventually get it to run it was extremely distorted, graphically incorrect and extremely slow. Being a feat such as this, it didn't matter how it looked, they'd done it! With the BIOS in place, it allowed the developers to open the doors on PS2 gaming, providing correct BIOS functionality, system configuration and setup provided by the BIOS, some of which is imperative for PS2 games to run.
From this point on, the team spent a lot of time implementing missing parts of the emulator and replacing hacks with correct emulation once those areas were understood, slowly improving the compatability and speed of the emulator, including the implementation of the first Recompiler into PCSX2 (may have been earlier, but it was before me!) which was coded and developed by Goldfinger, this provided a huge leap in speed from the age old Interpreter which is slow by design.
Through time several developers have come and gone. I (Refraction) joined the team around version 0.8.0 (January 2005) after submitting some MFIFO fixes which helped improve the emulation of Final Fantasy X. Later we had Zerofrog join the team, who is responsible for ZeroGS, ZeroSPU2 and the rewrite of the last incarnation of the VU and EE Recompilers which gave us the huge speed boost many will remember in v0.9.1 (June 2006).
During the summer of 2007, GiGaHeRz finally managed to crack one area of the emulator nobody had dare attempt before, he managed to get Netplay to happen! Myself, GiGaHeRz, CKemu, Saqib and Falcon4ever found ourselves logging in to Monster Hunter for a general meet up and drinking session in the virtual world. This was an amazing event for us, it was also very fun talking to other players with the conversation generally going like "Hey, I'm playing this on my PS3

" "That's cool, we are playing this on our PC's using PCSX2" "oooh! that's awesome! can we tag along with you guys?", it was very pant wetting for us and other players alike. Unfortunately, almost by sheer coincidence, many online servers shut down merely weeks after we announced we had netplay, I see no conspiracies, really
By 2008, Zerofrog had left the team to further his real life career at bigger companies and didn't have any more time for the team, myself and asadr (Saqib) were the only remaining developers and with such big pressures upon the two remaining developers, moral and willingness to code the emulator dropped off significantly, leaving myself and saqib just committing small changes to the emulator to keep things afloat, we we're not going to let it die on us now!
After the release of 0.9.4, GSDX went under major improvements from our very own Gabest to improve the overall speed of the popular GS plugin GSDX. He added support for DirectX 10, which solved the issue of clipping surfaces (Due to limitations with DirectX 9) and improved the caching methods of the plugin itself to bring the performance up. Additionally, he rewrote the entire software renderer to be faster overall, but also to allow extra threads, so running the graphics plugin in software mode on an i7, across several threads, showed little difference in performance to hardware mode! The graphics plugin continues to grow in strength with a lot of support from the other team members and the community providing information on current issues to be resolved.
In February 2009, we enlisted the help of a group of enthusiastic coders who had been hosting the project PCSX2 Playground, we recognized the big potential for these guys to be on the team in recognition of the amazing work they were doing. This expanded the team greatly, bringing in developers such as Jake.Stine (Air), arcum, cottonvibes and rama, ever since the emulator has been booming with improvements and advanced the state of the emulator immensely! Since then we have had developers such as pseudonym and gregory.hainaut join the team, both of whom have provided an excellent amount of work in their fields of expertise.
Here we are today, potentially months away from the release of PCSX2 0.9.8, which is showing the highest compatibility the emulator has ever shown, with nearly 59% of games being playable on the emulator (regardless of speed) and a further 22% of games getting to the ingame content, this is a whopping 81% compatibility! which is a huge feat when you look back to how the emulator was in the early days. We are ever closer to releasing what can be consider the near-on perfect Playstation 2 emulator, but before we can get there, many big challenges await us. Only time will tell.......
Of course a mention is needed for the many beta testers and plugin developers who have done their part to improve device emulation and root out the errors and problems for the devs to look into: Bositman, Prafull, CKemu, Falcon4ever, ChaosCode, Nachbrenner (Patch hacker extraordinaire), Crushtest, Neeve (VU/EE floating point accuracy improvements), RPGWizard, Chickenliver (Lilypad), Rebel_X (Twinpad), Luigi__(Megapad) and many many more, we appreciate everything you have done for the emulator, if i forgot your name, it's due to memory laps! we appreciate what you have done too!
As most people probably know, PCSX2 is primarily a dual-thread application. The two main threads are described as such:
- EE/Core thread emulates the PS2's EmotionEngine (including VIF, SIF, GIF, and VUs) and the IOP (including SPU2, CDVD, and PAD)
- GS thread emulates the PS2's Graphic Synthesizer (includes texture swizzling, texture filtering, upscaling, and frame rendering)
Each thread relies on the other thread in some way -- the GS thread cannot swizzle texture data until the EE thread has uploaded said data, for example. Meanwhile, the EE thread cannot upload texture data to the GS thread if the GS thread is currently bogged down rendering last week's frame to video. During these periods, either thread will
sleep, only to be woken up once the other thread has caught up in its workload.
In theory the act of sleeping the EE/GS threads should make benchmarking the CPU load registered by each thread pretty easy: all modern operating systems have built-in APIs for reading the busy/idle time of any thread on the system -- this is the same API used by your tried and true task/process manager, for example:
(Air shows off his personal favorite, ProcessExplorer, part of the SysInternals Suite)
This readout is simple, efficient, and seemingly reliable. It also avoids a lot of the annoying pitfalls one runs into trying to use common alternatives such as
rdtsc and
QueryPerformanceCounter.
... and this is precisely the method I decided to use for PCSX2 0.9.7.r3113 (and still in use as of r3878). Simple theory really: if the GS thread is sleeping a lot (low load) then the game is bottlenecked by EE/Core thread activity. If the EE thread is sleeping a lot and the GS thread reports 90+%, then the GS thread is the bottleneck (a problem often correctable through using lower internal resolutions, for example).
But as I've recently found out, it doesn't work as expected. -_-
It's filled with... threads!
The immediate problem faced by this simple method of load detection is that the latest wave of Windows Vista/7 GPU drivers themselves
are multithreaded. It should have come as little surprise that one of the primary goals of the new DWM/Aero/DX11 systems implemented into Vista/7 is scalable parallel processing that takes better advantage of modern multi-core CPUs. Why this causes the OS built-in thread load detection to fail might be less obvious; I'll explain with an example:
When the GPU driver receives a directive to render the current scene (aka 'Present' in DirectX lingo), it sends the job to a thread dedicated to the task. That thread has a
Present Queue, typically 1 or 2 frames deep, that automatically handles triple buffered vsync'd page updates. If the queue is full when the PCSX2 GS thread issues its next Present request, the
GPU driver will put the GS thread to sleep until a slot in the Present Queue becomes available. End result: The GS thread reports idle time to the operating system (and to PCSX2's GS window), but the GPU is still quite overloaded and bottlenecked via work supplied to it by a different thread altogether.
In essence, it is nearly the same sort of inter-thread dependence that the EE/Core and GS threads have between each other, only now the EE/Core thread's dependency chain extends to include GS
and GPU driver threads (of which there could be one or many).
The solution to this problem is to use a more traditional method of manual load checking: timing various sections of code executed in-thread via either the aforementioned rdtsc (timestamp) or QueryPerformanceCounter, read at key points in the GS thread's execution/program flow. This wasn't such a great idea a few years ago, due to K8/Athlon and P4 generation CPUs lacking a stable internal clock counter. Fortunately, all modern CPUs have a consistent counter suitable for benchmarking, so the pitfalls that have been long associated with using Intel/AMD timestamps are finally obsolete enough to not be a concern for us here.
Over the past two years I have become dearly intimate with Microsoft's Visual C++ 2008 compiler, and the methods it uses for optimizing code. Now generally speaking MSVC 2008 does well -- very well -- especially for everyday "not-so-clever" code. Its global optimization feature (aka Linktime Code Generation, or LTCG) is also a tremendous advantage over GCC -- though GCC is in the process of (finally!) adding LTCG to their own C/C++ compiler. MSVC does have a few very annoying failings as an optimizer, though.
The most glaring of which has to do with templated code and
inlined functions.
Disclaimer: This analysis is for Visual C++ 2008 only. I have not yet analyzed MSVC 2010's code generation. Some of these glitches may be improved or different (or worse even) in 2010. I'll post an update if/when I compile information it in the future.
Edit/Update: This bug only appears to manifest itself when the input parameters are 1st or 2nd generation propagated constants (which is hard to explain if you don't know what that means). So chances of hitting this bug are not actually all that common, but still plausible in many coding scenarios.
Inline Functions
Inline functions are the simpler sort, so I'll cover those first. Here's a simple example of some code that will be optimized away in certain situations.
Code:
static bool g_global = false;
__forceinline void DoSomething( void* dest, size_t size )
{
if (dest && size) memset(dest,0,size);
}
void main()
{
[... code ...]
// dest and size are known constants, so the compiler will inline the above
// function and eliminate all its code -- ie, this line will be effectively ignored.
DoSomething( NULL, 0 );
[... code ...]
}
The problem is that even though the DoSomething() call is effectively
ignored, Visual C++ will still generate code that assumes the function is modifying global memory. Why? Because the compiler's initial analysis of the function doesn't take into consideration the fact that it is being called/inlined with constants as parameters. That means the calling function (
void main() in this case)will have to flush/reload any global variables that may have otherwise been able to remain in registers.
This problem becomes worse the longer a function grows, because every new piece of code int he function can introduce additional optimization dependencies. For example, if a function contains SSE instructions and 128-bit stack operations, it may require mandatory stack-frame alignment,
even if the actual SSE code portions are optimized away.
Templates
For those who do not know, C++ (and C99) has a feature called templating; which is at its core a type-safe and debug-friendly replacement for macros. PCSX2 uses templates extensively to generate function call dispatches for various customizable features of the PS2. A common technique in templates is to use switch statements to simplify code:
Code:
template< uint value > void Dispatch()
{
[.. setup code ..]
switch(value)
{
case 1: [.. do stuff ..] break;
case 2: [.. do stuff ..] break;
case 3: [.. do stuff ..] break;
}
[.. cleanup code..]
}
In the above example, we've created a function that executes one of four possible actions. The only thing that changes between each action is the interior -- all actions share the same basic setup/cleanup code. Instead of using separate functions and/or macros to do four separate instances of the setup and cleanup code, we're able to merge everything into a single template function. The compiler will automatically optimize the function to use
only the selected path. If
'value' is 1, it runs switch case 1. If it is 0, the entire switch is disregarded, etc.
The problem is the same as with the inlined function above: Visual C++'s optimizer bases a lot of its optimization on the
whole function anyway, so dead code that isn't even part of a particular template can adversely impact MSVC's code generation strategy. If only one of the switch cases modifies global memory, any call to any other case will still result in the compiler flushing global registers. Fortunately this particular optimization is minor, and losing it has barely any noticeable impact on performance on modern CPUs.
Sparse Switches and Binary Irony
A second and more serious optimization failure occurs in templated/inlined functions, however; if the function happens to use
sparse switches. A sparse switch is one where the values are not contigious. Example:
Code:
switch(value)
{
case 0x0: if(toggle) { code; } break;
case 0x100: if(toggle) { code; } break;
case 0x101: if(toggle) { code; } break;
case 0x102: if(toggle) { code; } break;
case 0x520: if(toggle) { code; } break;
case 0x521: if(toggle) { code; } break;
case 0x522: if(toggle) { code; } break;
case 0x733: if(toggle) { code; } break;
}
In this example, MSVC's optimizer will employ the use of a
binary search to dispatch the switch. Rather than compare each value individually (8 compares), it will divide the switch into halves or quarters. The resulting optimized code typically finds the right case in two compares, with a worst case of 3-5 compares typically (a vast improvement over an individual linear search, which has a median of 4 compares and worst case of 8 compares). This a great and wonderful optimization and is often times
faster than using function lookup tables.
... but it actually
backfires if the
toggle value is a known constant (such as a template parameter). The optimization method of the switch statement is made by MSVC 2008
before it eliminates unused code. So even if you explicitly assign a
value of 0x101, MSVC 2008 will include its clever binary partition logic! The resulting pseudo-code generated by the MSVC optimizer ends up looking something like this:
Code:
if(value >= 0x520) return;
if(value < 0x100) return;
return; // which is the result of case 0x101 with toggle==false;
The explicit checks for equality are optimized out, as are all unused cases -- just the umbrella binary search logic remains, and all it does is return from the function without doing anything. So what should be a null function ends up having 2 pointless compares; ironically caused by a clever and highly effective optimization strategy in any other normal situation.
For those who don't know, DMA stands for
Direct Memory Access, and it refers to logic circuits in a computer that allow for the automated transfer of system memory to and from
peripherals. DMAs are beneficial because they are simple circuits that do work in parallel to the CPU -- while a DMA transfers data, the CPU is free to do other work.that requires more complex computations and logic. The end result is better utilization of the computer's maximum memory transfer bandwidth and computational/logical ability.
Traditionally DMAs are pretty simple. The Playstation 2's EmotionEngine, however, has an 'intelligent' programmable DMA controller (DMAC). Neatly translated, it means that the DMAC can do a lot more than just move raw data from place to place. It supports several modes of operation and has a number of special features to take advantage of the unique multi-core design of the EE. Furthermore, the EE's DMAC is much more tightly integrated with its memory bus than traditional DMAs, allowing it to transfer data with exceptional efficiency. These two features combined make the EE's DMAC a
key component to PS2 games developers -- in quite a few games, the DMAC actually does
more raw work than the EE Core CPU (R5900).
How The Real Thing Works
While emulating the actual hardware of the DMAC isn't usually needed, it can still be helpful to understand exactly how the PS2's real DMAC works at a hardware level. The EE DMAC operates at 147mhz (1/2th the EE's core clock speed), and transfers 128 bits (16 bytes) of memory per cycle; meaning that the theoretical maximum transfer rate of the DMAC is 2.4 GB/s (147mhz * 16 bytes). It's a nice number, but is technically unattainable even in ideal conditions. Further explanation will make it clear why.
The DMAC connects the PS2's 32 MB of Main Memory (RAM) to various peripheral interfaces, such as VIF (VPU), SIF (IOP), GIF (GS), and IPU (mpeg decoder). VIF, GIF, and IPU are all part of the Emotion Engine and operate at 147mhz, same as the DMAC itself. Thus each of those interfaces can send/receive data at roughly 2.4GB/s. SIF is limited by the IOP's own DMA controller and memry bus, which operates at 1/8th the speed of the EE's DMAC, or about 154MB/s.
Peripheral FIFOs
Each peripheral (VIF, GIF, SIF, IPU, etc) has a 128 or 256 byte FIFO. The FIFO helps mitigate occasional latency differences between Main Memory/SPRAM and the peripheral (some peripherals, in particular the GIF, can incur cycle stalls depending on data sent to them). Thanks to the FIFOs, data can be
burst to/from memory in 128-byte blocks, which helps maximize data transfer rates since the EE's memory bus was built to operate most efficiently in those conditions. However, the maximum bandwidth of Main Memory (32MB) in ideal conditions is only ~1.2GB/s (half of the DMAC), and has additional memory bank related latencies, reducing its effective transfer rates even further. If DMA transfers are only done to/from Main Memory, the DMAC will only be able to come within about 40% of its theoretical maximum throughput.
Enter the Scratchpad!
The Scratchpad (SPRAM) is 16KB of memory integrated directly into the EmotionEngine. Because it is directly integrated on-die, it has no read/write latencies and can
always be accessed at the maximum transfer rate of 2.4gb/s. The integrated nature of the SPRAM means it has to be small in order to fit -- and its lack of size is what limits its usefulness.
So in order to utilize the bandwidth potential of the EE DMAC, a PS2 programmer must find ways to use a combination of Main Memory and Scratchpad transfers in parallel: When main memory stalls due to inherent latencies, the DMAC will automatically busy itself with a pending SPRAM transfer. Likewise, while the DMAC is transferring to/from SPRAM, the EE's Main Memory becomes available to the CPU, which further improves the system's CPU throughput.
The Scratchpad's MemoryFIFO (MFIFO)
The MemoryFIFO function of the EE DMAC performs and managed two simultaneous DMA transfers, as follows:
- Scratchpad -> Main Memory (RAM)
- Main Memory (RAM) -> Peripheral (VIF1 or GIF)
As the buffer in memory is filled by Scratchpad, it is simultaneously drained by the attached peripheral, either VIF1 or GIF. On the surface, the MFIFO can appear to be somewhat silly, since the DMAC already has the ability to transfer direcly from SPRAM -> Peripheral. Adding a stop in Main Memory might seem like a waste of the DMAC's bandwidth capacity, but in some situations the 'extra work' can result in a general improvement in overall transfer speeds.
The PS2 engineers introduced the MFIFO for two reasons:
1. The scratchpad is too small. MFIFO can be used by the EE core as a place to "empty" the scratchpad after its completed a set of data processing. While the data in the MFIFO awaits the DMAC to transfer it, the EE is free to load new raw data into Scratchpad for processing.
2. The GIF has additional bandwidth constraints since it has direct connections to three
PATHs: the the VU1 co-processor (GIF PATH1), VIF1 FIFO (GIF PATH2), and the DMAC's GIF channel (GIF PATH3). When transfers are active on any one of the paths, the other two paths must idle/stall until the current path's transfer completes; meaning that DMAC transfers to both GIF and VIF1 channels can have unexpectedly long stalls.
So by using MFIFO, the EE core can mitigate the unpredictable GIF/VIF1 stalls while it works on entirely new sets of data in parallel. If a GIF transfer via DMA is stalled because of other PATH1 or PATH2 transfers, the DMAC can busy itself with other transfers in meantime, such as SPRAM->memory or memory->SPRAM. These transfers are nearly 'free' in a sense, since the DMAC would have been idle regardless -- but thanks to the MFIFO concept, the SPRAM itself will be free for use by the EE Core to continue processing data. Thus while the DMAC's overall productivity isn't affected, the EE's overall computational ability improves.
I'll talk a bit more on actual emulation details of the PS2's programmable DMA controller in future blogs, so this is
To Be Continued...