Please have a look at corrected primitive benchmark numbers:
Number of iterations : 50
asize,bsize,rsize = 30
csize = 100
Number of calls made to hpf_server = 21981
 
Instrumented code
 
Non-Instrumented(stand alone)
 
Total elapsed time 8.54(sec) 7.79
hpf_server 0.38 x
register_val 0.02 x

hpf_server       : total time hpf_server fucntion calls took
register_val     : total time register_var functions calls took during registration of variables.

Now we got close to real benchmark values. I did several tests. The results are approximately same.
What I did now is run manager ,Instrumentation server and actual F90 executable of instrumented version on
different processor. Before all of threee were running on same processor on which also HTTP server was running.
Surprisingly we have  very little overhead as you see.