All proceeds from Ad Clicks goes to the author of this site.

 

Tuesday, August 09, 2005

Dtrace equivalent for Linux only requires a PhD.

I found the below links thanks to Robert MilkowskiLinux and Solaris

Disclaimer: Yes I like Linux, I use Linux on a daily basis, But Solaris has done a lot of work and testing to make the new features in Solaris 10 really impressive and work well in the production environment.

Well if you read Linux on POWER versus Solaris 10, Part 1:A technical comparison. You get the idea that making what dtrace does is easy to accomplish in Linux, well if you dig a little deeper into how the Linux version works you are in for some surprises.

There are a number of powerful technologies available for Linux on POWER that provide some, if not all, of the features provided by DTrace. In the following, we provide a brief introduction to each of those tools.

They say KProbes is the tool that gives you the ability to add a probe to the kernel. They even provide a link to a KProbes how-to, Lets take a small task, lets see how often a syscall is being called on our production server okay in Solaris we just need to find the probe were looking for. Lets say we want to know who is mmap (memory mapping a file) okay sounds simple enough in Solaris we just do, I’m doing this the simple way not even looking up for arguments in dtrace to make the search easier. We use dtrace –l for list and find that the first line of output is what you need.

dtrace -l | grep mmap

195 syscall mmap entry

196 syscall mmap return

375 syscall mmap64 entry

376 syscall mmap64 return

5225 fbt genunix smmap_common entry

5226 fbt genunix smmap_common return

9395 fbt genunix smmaplf32 entry

9396 fbt genunix smmaplf32 return

12543 fbt genunix smmap32 entry

12544 fbt genunix smmap32 return

12545 fbt genunix smmap64 entry

12546 fbt genunix smmap64 return

13902 fbt genunix cdev_mmap entry

13903 fbt genunix cdev_mmap return

14146 fbt genunix ddi_mmap_get_model entry

14147 fbt genunix ddi_mmap_get_model return

24837 fbt cgsix cg6_mmap entry

24838 fbt cgsix cg6_mmap return

28247 fbt mm mmmmap entry

28248 fbt mm mmmmap return

Okay now that we have the name of the probe, lets write a simple script that says “here I am” everytime a mmap gets called.

syscall::mmap:entry

{

printf("here i am");

}

Dtrace is just some C dialect mixed with some awk, pretty easy no complex kernel coding so lets run it.

# dtrace -s test.d

dtrace: script 'test.d' matched 1 probe

CPU ID FUNCTION:NAME

0 195 mmap:entry here i am

^C

That is it. Were done, of course we could do a lot more and get more information, with just another line of code, but I’ll save that for another day dtrace comes with 30,000 probes on a basic install, if you want to watch the kernel most likely the probe is ready and waiting.

Now lets look at the KProbe solution.

First we have to install a patch on our kernel, well lets hope that we did this before the box went into production. And that our 3rd party software creator doesn’t have a problem with this, will IBM’s db2 customer support people be okay with having this patch in the kernel, we can only hope and pray.

$tar -xvzf kprobes-2.6.8-rc1.tar.gz
$cd /usr/src/linux-2.6.8-rc1
$patch -p1 < ../kprobes-2.6.8-rc1-base.patch

Okay next step,

Writing Kprobes modules

For each probe, you will need to allocate the structure struct kprobe kp; (see include/linux/kprobes.h for more information on this

Hmm so we need to write 3 functions in what looks its using some pretty deep C voodoo there. Do your production servers have compile tools installed?

/* pre_handler: this is called just before the probed instruction is
  *   executed.
  */
 
int handler_pre(struct kprobe *p, struct pt_regs *regs) {
       printk("pre_handler: p->addr=0x%p, eflags=0x%lx\n",p->addr,
                      regs->eflags);
       return 0;
}
 
 /* post_handler: this is called after the probed instruction is executed
  *   (provided no exception is generated).
  */
 
void handler_post(struct kprobe *p, struct pt_regs *regs, unsigned long flags) {
       printk("post_handler: p->addr=0x%p, eflags=0x%lx \n", p->addr,
                      regs->eflags);
}
 
 /* fault_handler: this is called if an exception is generated for any
  *   instruction within the fault-handler, or when Kprobes
  *   single-steps the probed instruction.
  */
 
int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr) {
       printk("fault_handler:p->addr=0x%p, eflags=0x%lx\n", p->addr,
                      regs->eflags);
       return 0;
}

Excited yet?

Next step we have to specify the kernel routine address okay we have 4 choices lets look use the first, it seems the easiest.

sys_mmap

0000000000425880 t __pci_mmap_make_offset_bus

0000000000425a60 t __pci_mmap_make_offset

0000000000425ba0 t __pci_mmap_set_flags

0000000000425bc0 t __pci_mmap_set_pgprot

0000000000425be0 T pci_mmap_page_range

000000000042d960 T sys32_mmap

000000000042da00 T sys32_mmap2

00000000004402c0 T sunos_mmap

0000000000459ca0 T do_mmap_pgoff

000000000045ada0 T build_mmap_rb

000000000045ae00 T exit_mmap

000000000045e1e0 T generic_file_mmap

000000000046a0a0 t shmem_mmap

0000000000476740 t exec_mmap

00000000004c65a0 t nfs_file_mmap

000000000051d580 t shm_mmap

00000000005259a0 t mmap_mem

0000000000525fe0 t mmap_zero

0000000000526260 t mmap_kmem

00000000005b9720 t proc_bus_pci_mmap

00000000005cc000 t fb_mmap

00000000005d61e0 t sbusfb_mmapsize

00000000005d6220 t sbusfb_mmap

00000000005da4a0 t atyfb_mmap

00000000005fd080 t sock_mmap

00000000005ffc40 T sock_no_mmap

0000000000661a20 t packet_mmap

0000000000718db8 d ffb_mmap_map

0000000000719168 d cg6_mmap_map

000000000071f7a0 d packet_mmap_ops

000000000072a4f0 R __ksymtab_do_mmap_pgoff

000000000072adb0 R __ksymtab_generic_file_mmap

000000000072f3f0 R __ksymtab_sock_no_mmap

0000000000731f60 R __kstrtab_do_mmap_pgoff

0000000000732e10 R __kstrtab_generic_file_mmap

000000000073b2e8 R __kstrtab_sock_no_mmap

phoenix:/boot#

okay we now have address looks like its

000000000042d960 T sys32_mmap

Well now its time to write some more code,.

/* specify pre_handler address
  */
       kp.pre_handler=handler_pre;
 /* specify post_handler address
  */
       kp.post_handler=handler_post;
 /* specify fault_handler address
  */
       kp.fault_handler=handler_fault;
 /* specify the address/offset where you want to insert probe.
  * You can get the address using one of the methods described above.
  */
       kp.addr = (kprobe_opcode_t *) kallsyms_lookup_name("do_fork");
 
 /* check if the kallsyms_lookup_name() returned the correct value.
  */
       if (kp.add == NULL) {
                      printk("kallsyms_lookup_name could not find address
                                                                    for the specified symbol name\n");
                      return 1;
       }
 
 /*   or specify address directly.
  * $grep "do_fork" /usr/src/linux/System.map
  * or
  * $cat /proc/kallsyms |grep do_fork
  * or
  * $nm vmlinuz |grep do_fork
  */
       kp.addr = (kprobe_opcode_t *) 0xc01441d0;
 
 /* All set to register with Kprobes
  */
        register_kprobe(&kp);

no were not done yet next step is have to add printf code into the kernel function.

Getting the offset

You can insert printk's at the beginning of a routine or at any offset in the function (the offset must be at the instruction boundary). The following code samples show how to calculate the offset. First, disassemble the machine instructions from the object file and save them as a file:

$objdump -D /usr/src/linux/kernel/fork.o > fork.dis

Well that produces a nice object dump you know assembly language right?

Now we have yet more details to look into,

To insert the probe at offset 0x22c4, get the relative offset from the beginning of the routine 0x22c4 - 0x22b0 = 0x14 and then add the offset to the address of do_fork 0xc01441d0 + 0x14. (To ascertain the address of do_fork, run $cat /proc/kallsyms | grep do_fork.)

You can also add the relative offset of do_fork 0x22c4 - 0x22b0 = 0x14 to the output of kallsyms_lookup_name("do_fork"); Thus: 0x14 + kallsyms_lookup_name("do_fork");

okay now that we have done all that we are able to start our new probe.

Enabling the magic SysRq key

We already compiled in support for the SysRq key. Enable it with:

$echo 1 > /proc/sys/kernel/sysrq

Now you can use Alt+SysRq+W to view all inserted kernel probes on the console, or in /var/log/messages.

Well the dtrace example and the introduction text to this document ended at on page 2, In the word processor I’m using I just crossed page 6, KProbes are simple right? In case you are wondering I was pasting the examples from the document because I’m not a kernel hacker, and I don’t have 4 hours to learn enough of there code. For those of you say I picked an overly simple example for dtrace, lets do there example in dtrace so you can compare.

dtrace -l | grep fork | grep syscall
9 syscall forkall entry
10 syscall forkall return
203 syscall vfork entry
204 syscall vfork return
245 syscall fork1 entry
246 syscall fork1 return
#

We have the list of probes It should be fork1 so lets write some code.

# cat test2.d

syscall::fork1:entry

{

printf("tid: %d, pid: %d, execname: %s\n", tid, pid, execname);

}

Okay here is the output of the above code, gives you even more info than the KProbes vesion.

dtrace -s test2.d

dtrace: script 'test2.d' matched 1 probe

CPU ID FUNCTION:NAME

0 245 fork1:entry tid: 1, pid: 1998, execname: sh

1 245 fork1:entry tid: 1, pid: 1998, execname: sh

Well to wrap things up, apparently KProbes is made for the kernel programmer, and dtrace is made for the Sys-admin, most sys-admin know a little C code, and sed and awk, so we have to simple dtrace scripts, and we have 5 pages of fairly complex C kernel code, to do less. Which do you want to use in production?

For more blogs related to Linux,
dtrace, KProbes, Solaris.

8 Comments:

Blogger McBofh said...

It looks to me as if enabling KProbe usage requires more effort than adding a statically defined D probe to your driver or kernel module.

Why would a sane person bother?

Oh, I know, to "prove" that because you can do it on Linux, therefore Linux is just as good as Solaris.

Bzzzzt.

8:53 PM  
Blogger PerformanceGuru said...

This comment has been removed by a blog administrator.

12:57 AM  
Blogger PerformanceGuru said...

This comment has been removed by a blog administrator.

12:58 AM  
Blogger PerformanceGuru said...

You should look at Systemtap of you are looking for a true dtrace equivalent - http://sourceware.org/systemtap

12:59 AM  
Blogger Boyd Adamson said...

performanceguru: I may have misunderstood what I read at http://sourceware.org/systemtap/runtime/start_page.html but to say that that's a dtrace equivalent implies to me that you haven't seen dtrace.

2:29 AM  
Blogger jamesd_wi said...

well i have taken a look at systemtap, see my comments at

http://uadmin.blogspot.com/2005/08/systemtap-alpha.html

also it would be nice to see the output of even a simple script running on a live system done using systemtap along with its script that generated the code.

10:27 AM  
Blogger Ethan Anderson said...

Just how hard would it be to port dTrace to Linux? Is this a licensing issue or a technical one?

2:22 AM  
Blogger crisp said...

its not that difficult to port dtrace to linux; i have a near working port (see www.crisp.demon.co.uk/tools.html for the source code).
its a part time thing for me, and after about 6 weeks - its coming together. see my blog on the web site for odd progress tips.

3:23 PM  

Post a Comment

<< Home