From: bmc@kiowa.eng.sun.com (Bryan Cantrill) Subject: Re: But why? Date: 1996/10/29 Message-ID: <5543dq$qjd@engnews2.Eng.Sun.COM> references: <54990q$n5e@caip.rutgers.edu> organization: Brown University keywords: none newsgroups: comp.sys.sun.hardware In article <54990q$n5e@caip.rutgers.edu>, David S. Millerwrote: >bacon@mtu.edu (Jeff Bacon) writes: > >Since I have been able to find an intelligent posting in this thread, >I will respond to it and explain what I can as chief architect of the >SparcLinux port. > >> of course, then, the obvious question that comes up is WHY is it that >> solaris has such higher overhead costs in doing things? >> >> obviously there's more code to work through to do any given thing. >> someone must have thought it necessary. but...why? >> >> obviously it's got lots of extra crud from SVR4. why not pitch it? >> > >The answer to this is pretty straight forward actually. The main >points of interest are: > >1) Solaris's networking stack, in all of it's incantations (one breed > of it was the Lochman code in 2.0, 2.1 and early 2.2 releases, then > it was rewritten by another company for 2.3 onward) is SVR4 streams > based. The performance penalty, even with lots of tricks, for > using a SVR4 streams networking architecture is well known. > Someone who happens to have a 2.2 Solaris CD around, or even a 2.3 > Solaris CD, should install that thing and run lmbench on it to see > what "pure Streams based networking" without the tricks can really > do. > > Linux on the other hand has a "no bullshit" networking architecture > that is not streams based. Yet we also take advantage of the many > known networking performance enhancements that exist in the > research realm (ie. copy/checksum, the van jacobson hacks, etc.) > >2) Linux is light weight, Solaris is a pig. > > One of the most critical things that contributes to performance > is cache/tlb footprint of the operating system. Linux being small > (yet still provide a full POSIX unix environment!) solves the cache > footprint problem in a big way. I've solved the TLB footprint > using Linux's small size and a Sparc specific trick. > > The MMU's present on the sun4m/sun4d line of Sun machines possess a > three level page table scheme. Using this, one has the capability > to use the normal 4k sizes pages, and also larger 256k and 16MB > sized pages. The average TLB on these machines has 32 or 64 > entries to cache these pte's, if the entry is not in the TLB > hardware has to go out to the memory bus and walk the software page > tables to "reload" the TLB so that the translation can be > satisfied. > > This "miss processing" is very expensive. Under SunOS and Solaris, > they do not take advantage of the 16MB and 256k sized pages to map > the operating system. Therefore those two systems take many misses > in the TLB during even the most rudimentary trap into the kernel. > However under Linux the TLB misses for the OS are quite minimal. > In fact I will gave an example: > > On your average SparcClassic with a 32 entry TLB, consider such > a machine with 24MB of memory installed. Under Linux I can map > the entire operating system (sans IO device register mappings > and Lance Ethernet DMA) in 3 (count 'em, 3!) TLB entries. These > 3 entries are enough to allow the kernel to access an arbitrary > physical page from kernel space. > > Under Solaris, the OS would need 3 + (24MB / 256K) + (24MB / 4K) > TLB entries to map this same amount of space. For a great many > number of operations, it is quite easy for an OS with this page > table strategy to blow the entire user context out of the > hardware TLB. Which in turn means more processor stalls (in > fact many) for both the user level processes and the operating > system. > > Result? Severe degradation in performance for the latter > scheme. > >3) Every BSD and SVR4 based system today, except for Linux, has a very > broken System call mechanism. > > You'd think that when people put together function call conventions > for a particular processor, the OS people would take a look at this > and find a way to take advantage of this. In fact, believe it or > not, they have not to this very day. > > Linux from day one, takes advantage of the procedure call > conventions on a particular architecture so that it can process > system calls in the most expediant way possible. I will give > an example on the Sparc to prove this: > > Consider your average 3 argument system call. The user level > code does something like this: > > mov %arg0, %o0 > mov %arg1, %o1 > mov %arg2, %o3 > mov SYSTEM_CALL_NUMBER, %g1 > t SYSCALL_TRAP > > At this point control reaches the operating system, it must > prepare to handle this request from the user. On the Sparc, this > is either a two step or a three step process depending upon > whether you are doing it in the traditional broken UNIX way or the > clean, fast, and superior Linux way. First I will show the Linux > method for doing this: > > 1) Step one, jump onto the kernel stack for this task > and make sure the kernel has a register window to > operate in safely. > > For Linux the code path for this runs at ~18 instructions > for the common case (the kernel already has a valid > register to use so now saving needs to be done). It runs > at ~42 instructions for the second most common case (the > kernels needs to allocate a new register window and the > user has a valid stack pointer) and ~82 instructions for > the least common case (kernel needs a window, user has > an invalid stack pointer, thus the kernel needs to save > the user's window into a special per-task save area). > > 2) Take the system call number, check for valid value, use > this to offset into a table of system call function ptrs. > Move arguments into place and perform the syscall. > > Basically this is a simple operations a looks something > like: > > sll %g1, 2, %l4 ! produce offset > ld [%l7 + %g1], %l7 ! syscall ptr base was in %l7 > SAVE_ALL ! perform step #1 above > mov %i0, %o0 > mov %i1, %o1 > mov %i2, %o2 > mov %i3, %o3 > mov %i4, %o4 > jmp %l7, %o7 > mov %i5, %o5 > > That is it, that is the entire system call under Linux. > > Under Solaris/SunOS things are wildly different. Step one is > basically the same, but step 2 is disgustingly inefficient for > those systems. Basically they do: > > 2) Call common system_call() C function. > > 3) This routine allocates a "system call argument package" > structure on the kernel stack. This is wasteful because > we already have all of this information in registers or > in guarenteed save areas. > > 4) Then this routine determines the function to call, and > passed this "package" of arguments to the routine. > > 5) Every system call which expects arguments then must > "unpack" this structure to get at the copy of the arguments > again highly inefficient. > > For every system call the system performs, you eat this unnecessary > overhead under SunOS/Solaris, under Linux only the bare minimum is > performed to do the system call successfully. > >4) Solaris cannot even do it's own optimizations correctly because > SunPRO is a broken compiler. > > I won't make such a statement without explaining this with real > facts, here goes. > > A neat part of the Sparc ABI is that it leaves you with a few > processor registers that the C compiler is not allowed to use > in the code it produces. Two of which are "%g6" and "%g7". > A problem in unix kernels is that you are constantly accessing > the current tasks control structure ('proc' and 'uarea' on > traditional UNIX's, the 'task_struct" under Linux). Hey, why > not put these pieces of information in those "extra" registers > and avoid the address computation all the time? Yes, very > brilliant idea. > > Under Solaris the trap entry code places the uarea and proc ptr > in %g6 and %g7. Under Linux the trap entry code places the > current processes task_struct in %g6. Now here comes where the > implementations differ. > > Under Solaris all of the so called "locore" (basically all the > gook which has to be written in raw assembly) code can directly > take advantage of this. However, the C code cannot do this > because SunPRO lacks a way for you to tell the compiler that > "hey you don't need to load things, it's already in these > hard coded registers" So they have the C code call these little > assembly stubs to get the values: > > get_uarea: > retl > mov %g6, %o0 > > get_proc: > retl > mov %g7, %o0 > > That is gross, why even do the optimization in the first place? > > Now GCC has a way to fully take advantage of such an optimization, > basically all I have to do is put the following in a header file. > > register struct task_struct *current asm("g6"); > > Tada, now GCC will fully understand what I have done for it. > Under SparcLinux this optimization alone took away 115 instructions > in the scheduler sources, and it took ~50 instructions out of the > exit() handling, and it took ~65 instructions out of the fork() > handling. > >So my question always is, in matters such as these. Who are these >processor cycles for anyways, the kernel or the user? Think about >this when you consider how much overhead is being saved from one OS to >another, and to what scale this is occurring. > >I hope that explains some of it, and gives people at least some sort >of idea of the kinds of things that makes Linux scream on just about >any hardware. If people would like more explainations like the above, >I'd be more than happy to chat with you via email about it or >similar. I love talking about performance issues on various >processors and systems. > >Oh, and one thing that has not been mentioned yet in this thread (and >yes NetBSD/OpenBSD both have this as well, good work guys). That >SparcLinux kernel that gets all of this incredible performance runs on >both sun4c and sun4m machines. Sun Engineers way back when scratched >their heads for months and couldn't figure out a way to pull it off >(you need a seperate kernel image depending upon whether you are >running on a sun4m or a sun4c, for SunOS/Solaris). And on top of that >Linux obviously pulls it off efficiently. > >One final note. When you have to deal with SunSOFT to report a bug, >how "important" do you have (ie. Fortune 500?) to be and how big of a >customer do you have to be (multi million dollar purchases?) to get >direct access to Sun's Engineers at Sun Quentin? With Linux, all you >have to do is send me or one of the other SparcLinux hackers an email >and we will attend to your bug in due time. We have too much pride in >our system to ignore you and not fix the bug. > >David S. Miller >davem@caip.rutgers.edu > Have you ever kissed a girl? - Bryan ---------------------------------------------------------------------- Bryan Cantrill, Solaris Performance. bmc@eng.sun.com (415) 786-3652