profiler which does not just sum up used time

W

Willy Denoyette [MVP]

Ben Voigt said:
This is because there's no such thing as a "driver thread" or "non-driver
thread". There are only threads executing user mode code, and threads
executing kernel mode code (be it from a 3rd party driver, bundled driver,
or core kernel).

I know that there are only threads, or they run in user mode or kernel mode
, what I'm trying to explain is that kernel code can run at real-time
priority levels without the need to execute driver code, such code can even
run at IRQL 1 or 2.

Staying at elevated IRQL is an error, so that can't be considered to
provide realtime processing.

I never suggested doing so, this would render the system unusable.
Only hardware interrupt handlers are realtime.

Yep, that's why I said from the beginning that Windows is not a real-time
OS.

Willy.
 
J

Joe Withawk

Taking the dubious claim of "several thousand frames a second" at face

Ok. 2400. It is more than two so several is not wrong really :)
So what if you lose a frame? Even at 60fps (the maximum rate for video
synchronized to a typical LCD monitor), losing a single frame now and then
isn't going to be noticeable at all by a human being.

It is graphics rendered interlaced and fed into a hardware box which turns
this into a PAL tvsignal. If I lose one or more frames, I dont know how many
and the odd even scanlines can get screwed up.
There's not a single human user in the world who would notice the
occasional lost frame (or even dozens of frames) at a frame rate of
"several thousand frames a second".

I do not render at the rate. I sync to screen. I am only saying that the
rendering itself is fast. If i do not sync, I can render very fast. When I
sync I render 60 fps. Point being... it is not because I have costly
rendering that I am missing frames.
Given that a thread quantum on Windows is around 50 ms, I'd say you're
pretty lucky you're _only_ getting blocked for 3 [Joe:it is 50. read
above]

Interesting that that is an ordinary windows thread timeslize. Did not know
the value.
What hardware? What buffer? Why is it so small that 50ms allows it to
underflow? How is this connected to the question of rendering to the
screen? What are you actually trying to do here?

I have a box which need me to give it exactly 50 frames a second. I
currently test with just a lcd-display at 60Hz, but in the end it is 50Hz
for the box.
The frames are really half frames with interlaced graphics. It expects to
get even scanlines, then odd then even. If I render even, miss a frame (here
a field) and then render odd, then i am in trouble since NOW the box
expected the even lines again
me: even, miss, odd, even
it : even, odd, even, odd

I do not even know for sure how many i missed so I can get backon the horse
by skipping a field (half frame)
You can do things with the threads as suggested by Willy and Ben to try to
hog the processor all to yourself, but at that point you've sacrificed any
semblance of a stable Windows installation. Windows wasn't built to be
used that way, and doing so is almost certainly going to lead to trouble.
If you need to use a computer in that way, you need a different OS.

If my code runs on one core all the time and the rest of the system runs on
other cores, then I am happy. I do not care about other user applications
since only mine will run on the entire system. It is me and the system so to
speak.
Yes, that's what I'm saying. Windows is actually not that bad at
scheduling threads, if you let it do its job. Generally speaking, threads
that never become unrunnable and which can use as much CPU time as you can
give it are actually much better off running at _lower_ priority rather
than higher.

Well.. I tried being nice and letting the thread and the app run at normal
priorities and the result was that I was preempted for too long. I can not
tollerate a delay of 50+ms. If the graphics card had a buffer larger than 3
frames, then it could have lived with up to 17ms*number of buffers and been
happy.
 
J

Joe Withawk

Just wonder how you measured these timings and what code you are actually
running when measuring.
As said in another reply, the real-time priority class will put all
threads in the process in the real-time range (at level 24), a couple of
system threads may be running at level 31, the system can even run threads
at an interrupt request level that prevents other threads to run. As I
said before, Windows is not a real-time OS, you have no control over the
thread scheduler other than increasing the relative priority level of one
user thread over another user thread, but you won't be able to stop the
scheduler to prevent pre-emption of your thread, so you can't achieve what
you want on Windows (without the risk of rendering the system unusable),
that is, running a thread on a single core without being interrupted or
pre-empted, you can't even prevent being re-scheduled on another core.

I allocate an array. Then I at the start of my loop look at the system
counter, do my work, add the elapsed time to the array and increase an
index.
When I reach n samples I write them to a file and look at them.
What I see is that my loop takes arround 16 ms (running at 60Hz). Some take
a little longer and some a little shorter due to the buffer being there.
Then suddenly it takes much much longer. Up to over 100ms, and here I will
lose frames, and then it resumes at arround 16.
 
B

Barry Kelly

Joe said:
I have a problem with an application which runs smoothly but once every few
seconds shudder.
Having used the .net stopwatch class I now know that a certain loop
generally runs very quickly but sometimes it takes ten times as long.
The loop is not trivial so the easiest way for me to find the reason for the
delay would be to profile it in such a way that I can see the time each call
to it takes and not just the accumulated time.

I'm interested in your problem. I've modified the code you posted to try
and reproduce the spikes you saw, but I can't.

I can think of two obvious possible culprits:

* GC pauses
- I don't know how much garbage you're generating in your application,
but depending on its size and nature (e.g. if you're not recycling large
objects living on the large object heap), then you'll end up hitting
expensive gen2 collections too often. What kind of percentage GC are you
seeing in existing profilers, and how does it compare to what you
calculate your total 'long pauses' are as a fraction of total time?

* Badly behaved drivers
- It's possible that you've got a badly implemented network or sound
card driver (or others) being activated by background tasks. Does the
same problem occur on other machines with different hardware?

Here's the code (C# 3.0) I was working with, trying to reproduce your
data at http://www.greenleaf.dk/time.jpg :

I tweaked the inner loop to run in about 16.6 ms on my machine.

---8<---
using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;
using System.Diagnostics;

class App
{
static double Variance(IEnumerable<double> values)
{
double average = values.Average(x => x);
double result = 0;
foreach (double value in values)
{
double var = average - value;
result += var * var;
}
return result;
}

static double StandardDeviation(IEnumerable<double> values)
{
return Variance(values) / values.Count();
}

static void Main()
{
Stopwatch sw = new Stopwatch();
double[] times = new double[2000];
int c = 0;
while (true)
{
sw.Reset();
sw.Start();
double y = 0;
for (int i = 0; i < 50000000; i++)
y = Math.Sin(Math.Cos(Math.Sqrt(i)));
sw.Stop();
times[c++] = sw.ElapsedTicks
/ (double) Stopwatch.Frequency * 1000;
if (c == times.Length)
{
using (StreamWriter w = File.CreateText("times.txt"))
{
foreach (var time in times)
w.WriteLine(time);
w.WriteLine();

var average = times.Average(t => t);
w.WriteLine("Average: {0:F3}ms", average);
w.WriteLine("Variance: {0:F3}ms", Variance(times));
w.WriteLine("StdDev: {0:F5}ms",
StandardDeviation(times));

foreach (var result in
from index in Enumerable.Range(0, times.Length)
let time = times[index]
let diff = (time - average) / average
where Math.Abs(diff) > 0.02 // 2%
select new { Index = index, Time = time,
Difference = diff })
{
w.WriteLine(
"Perturbed: {0:F3}ms at {1}, " +
"difference {2:F3}%",
result.Time, result.Index,
result.Difference);
}
}
break;
}
}
}
}

--->8---

Compiled using:

$ csc -optimize -debug- app.cs

I'm running it in a Cygwin Bash shell, using the nice command to elevate
its priority:

$ nice -n -20 ./app.exe

I get these tail results from the output, 'times.txt':

---8<---
Average: 16.666ms
Variance: 4.355ms
StdDev: 0.00218ms
Perturbed: 17.013ms at 379, difference 0.021%
Perturbed: 17.055ms at 439, difference 0.023%
Perturbed: 17.105ms at 499, difference 0.026%
Perturbed: 17.033ms at 559, difference 0.022%
Perturbed: 17.216ms at 838, difference 0.033%
Perturbed: 17.031ms at 979, difference 0.022%
Perturbed: 17.003ms at 1399, difference 0.020%
--->8---

These differences are minor, not high enough to be your 10x inflation.

-- Barry
 
C

Chris Nahr

I have a problem with an application which runs smoothly but once every few
seconds shudder.

Sounds like garbage collection to me. Are you allocating and
releasing lots of objects?
 
B

Barry Kelly

Barry said:
static double Variance(IEnumerable<double> values)
{
double average = values.Average(x => x);
double result = 0;
foreach (double value in values)
{
double var = average - value;
result += var * var;
}
return result;
}

static double StandardDeviation(IEnumerable<double> values)
{
return Variance(values) / values.Count();
}

Above code is bad though :) Should do division by Count() in Variance,
and do a Sqrt in StandardDeviation.

-- Barry
 
W

Willy Denoyette [MVP]

Barry Kelly said:
Joe said:
I have a problem with an application which runs smoothly but once every
few
seconds shudder.
Having used the .net stopwatch class I now know that a certain loop
generally runs very quickly but sometimes it takes ten times as long.
The loop is not trivial so the easiest way for me to find the reason for
the
delay would be to profile it in such a way that I can see the time each
call
to it takes and not just the accumulated time.

I'm interested in your problem. I've modified the code you posted to try
and reproduce the spikes you saw, but I can't.

I can think of two obvious possible culprits:

* GC pauses
- I don't know how much garbage you're generating in your application,
but depending on its size and nature (e.g. if you're not recycling large
objects living on the large object heap), then you'll end up hitting
expensive gen2 collections too often. What kind of percentage GC are you
seeing in existing profilers, and how does it compare to what you
calculate your total 'long pauses' are as a fraction of total time?

* Badly behaved drivers
- It's possible that you've got a badly implemented network or sound
card driver (or others) being activated by background tasks. Does the
same problem occur on other machines with different hardware?

Here's the code (C# 3.0) I was working with, trying to reproduce your
data at http://www.greenleaf.dk/time.jpg :

I tweaked the inner loop to run in about 16.6 ms on my machine.

---8<---
using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;
using System.Diagnostics;

class App
{
static double Variance(IEnumerable<double> values)
{
double average = values.Average(x => x);
double result = 0;
foreach (double value in values)
{
double var = average - value;
result += var * var;
}
return result;
}

static double StandardDeviation(IEnumerable<double> values)
{
return Variance(values) / values.Count();
}

static void Main()
{
Stopwatch sw = new Stopwatch();
double[] times = new double[2000];
int c = 0;
while (true)
{
sw.Reset();
sw.Start();
double y = 0;
for (int i = 0; i < 50000000; i++)
y = Math.Sin(Math.Cos(Math.Sqrt(i)));
sw.Stop();
times[c++] = sw.ElapsedTicks
/ (double) Stopwatch.Frequency * 1000;
if (c == times.Length)
{
using (StreamWriter w = File.CreateText("times.txt"))
{
foreach (var time in times)
w.WriteLine(time);
w.WriteLine();

var average = times.Average(t => t);
w.WriteLine("Average: {0:F3}ms", average);
w.WriteLine("Variance: {0:F3}ms", Variance(times));
w.WriteLine("StdDev: {0:F5}ms",
StandardDeviation(times));

foreach (var result in
from index in Enumerable.Range(0, times.Length)
let time = times[index]
let diff = (time - average) / average
where Math.Abs(diff) > 0.02 // 2%
select new { Index = index, Time = time,
Difference = diff })
{
w.WriteLine(
"Perturbed: {0:F3}ms at {1}, " +
"difference {2:F3}%",
result.Time, result.Index,
result.Difference);
}
}
break;
}
}
}
}

--->8---

Compiled using:

$ csc -optimize -debug- app.cs

I'm running it in a Cygwin Bash shell, using the nice command to elevate
its priority:

$ nice -n -20 ./app.exe

I get these tail results from the output, 'times.txt':

---8<---
Average: 16.666ms
Variance: 4.355ms
StdDev: 0.00218ms
Perturbed: 17.013ms at 379, difference 0.021%
Perturbed: 17.055ms at 439, difference 0.023%
Perturbed: 17.105ms at 499, difference 0.026%
Perturbed: 17.033ms at 559, difference 0.022%
Perturbed: 17.216ms at 838, difference 0.033%
Perturbed: 17.031ms at 979, difference 0.022%
Perturbed: 17.003ms at 1399, difference 0.020%
--->8---

These differences are minor, not high enough to be your 10x inflation.

-- Barry



You don't believe you are executing 5000000 *
Math.Sin(Math.Cos(Math.Sqrt(i)));
per 16.6 mseconds do you ;-)?

What is effectively done here is incrementing i 50000000 times, the Math
operation is hoisted as the results are never used outside the loop.

Willy.
 
W

Willy Denoyette [MVP]

Joe Withawk said:
I allocate an array. Then I at the start of my loop look at the system
counter, do my work, add the elapsed time to the array and increase an
index.
When I reach n samples I write them to a file and look at them.
What I see is that my loop takes arround 16 ms (running at 60Hz). Some
take a little longer and some a little shorter due to the buffer being
there. Then suddenly it takes much much longer. Up to over 100ms, and here
I will lose frames, and then it resumes at arround 16.



Yep, this describes the code you have posted before which doesn't compile, I
would see your complete working code.

Willy.
 
W

Willy Denoyette [MVP]

Joe Withawk said:
I allocate an array. Then I at the start of my loop look at the system
counter, do my work, add the elapsed time to the array and increase an
index.
When I reach n samples I write them to a file and look at them.
What I see is that my loop takes arround 16 ms (running at 60Hz). Some
take a little longer and some a little shorter due to the buffer being
there. Then suddenly it takes much much longer. Up to over 100ms, and here
I will lose frames, and then it resumes at arround 16.


Please post your exact code, there must be something wrong with it.

Not sure about what OS you are running, but if it happens to be Vista or
WS2008, you could take advantage from the multimedia scheduler service.
When you register a thread with this service you are guaranteed to get
scheduled at a fixed rate (10 msec by default for all pre-defined tasks).
Following are the API declarations and a usage sample.

// API's (Vasta or WS2008 only
[DllImport("Avrt.dll"), SuppressUnmanagedCodeSecurity]
extern static IntPtr AvSetMmThreadCharacteristics(string task, ref int
index);
[DllImport("Avrt.dll"), SuppressUnmanagedCodeSecurity]
[return: MarshalAs(UnmanagedType.Bool)]
extern static bool AvRevertMmThreadCharacteristics(IntPtr handle);

// usage
IntPtr tskHandle = AvSetMmThreadCharacteristics("Pro Audio", ref
inx);
// run this part of code at a priority level defined by the "Pro
Audio" task.
// Using this task, your thread is guaranteed to get the CPU every
10 msecs for at least 8 milliseconds,
// and will run at realtime priority level of 24, irrespective the
activity of the other threads in the system
// code follows here
....

// revert to kernel scheduling when done with this task
AvRevertMmThreadCharacteristics(tskHandle);


Willy.
 
B

Barry Kelly

Willy said:
You don't believe you are executing 5000000 *
Math.Sin(Math.Cos(Math.Sqrt(i)));
per 16.6 mseconds do you ;-)?

Of course not. I simply upped the iterations until it took up 16.6ms.
Oddly enough, with my Core 2 Quad @ 3GHz, it works out at almost exactly
1 loop iteration per CPU clock tick.
What is effectively done here is incrementing i 50000000 times, the Math
operation is hoisted as the results are never used outside the loop.

I do write compilers for a living :)

-- Barry
 
W

Willy Denoyette [MVP]

Barry Kelly said:
Of course not. I simply upped the iterations until it took up 16.6ms.
Oddly enough, with my Core 2 Quad @ 3GHz, it works out at almost exactly
1 loop iteration per CPU clock tick.

The loop is "reduced" to three instructions, an increment a compare and a
branch, which is exactly what can be executed in a single clock cycle.
If you move the result of the expression in an array, you will force the
expression to be executed, but reduce the loop count before doing so ;-)



double[] results = new results[2000];
int n = 0;
for (int i = 0; i < 100000; i++)
y = Math.Sin(Math.Cos(Math.Sqrt(i)));
sw.Stop();
times[c++] = sw.ElapsedTicks
/ (double) Stopwatch.Frequency * 1000;
results[n++] = y;



Willy.
 
J

Joe Withawk

Yep, this describes the code you have posted before which doesn't compile,
I would see your complete working code.

Here it in general gives values of 1.9 with periodic spikes going up to 3
and random spikes going to 5.
It is the relevant code which comes from a vs2008 standard windows forms
application.

private void Form1_Load(object sender, EventArgs e)

{

sw = System.Diagnostics.Stopwatch.StartNew();

System.Threading.Thread t = new System.Threading.Thread(test);

t.Priority = System.Threading.ThreadPriority.Highest;

t.Start();

while (t.IsAlive) ;

Close();

}

System.Diagnostics.Stopwatch sw;

double[] time = new double[10000];

int c = 0;

void test()

{

while (true)

{

sw.Reset();

sw.Start();


double y = 0;

for (int i = 0; i < 10000; i++)

{

y = Math.Sin(Math.Cos(Math.Sqrt(i)));

if (y == 10) //never is, but should check to prevent optimization

return;

}

sw.Stop();

time[c++] = (double)(sw.ElapsedTicks) /
System.Diagnostics.Stopwatch.Frequency * 1000;

if (c == 5000)

{

System.IO.TextWriter writer = System.IO.File.CreateText("c:\\log.txt");

for (int ii = 0; ii < 5000; ii++)

writer.WriteLine(time[ii].ToString());

writer.Close();

return;

}

}

}
 
J

Joe Withawk

Not sure about what OS you are running, but if it happens to be Vista or
WS2008, you could take advantage from the multimedia scheduler service.
When you register a thread with this service you are guaranteed to get
scheduled at a fixed rate (10 msec by default for all pre-defined tasks).
Following are the API declarations and a usage sample.

I have read that "Multimedia threads are boosted into real-time for 80% of a
task's clock rate (default is 1ms)."
I am unsure about what exactly this means. 80% of the tasks clockrate? You
say "Using this task, your thread is guaranteed to get the CPU every
10 msecs for at least 8 milliseconds". That fits the 80%, but how can you
make such a promise? If I have three such threads then they should each get
8ms running time every 10ms. This would not work one one core. The same
problem will exist for multiple cores with more threads. Am I missing
something here?

If I am guranteed that every 10 ms I get to run for at least 8ms, then I
will be happy. That should do the trick, and would be realtime enough for
me.

I tried the code and initially the free running framerate increased by some
20%. When I tried running the application again it did not go as high.
Perhaps it is time to reset the system and see if something is just badly
messed up here.

Thanks for all the input.
 
W

Willy Denoyette [MVP]

Joe Withawk said:
I have read that "Multimedia threads are boosted into real-time for 80% of
a task's clock rate (default is 1ms)."

Default is 10msec.
I am unsure about what exactly this means. 80% of the tasks clockrate? You
say "Using this task, your thread is guaranteed to get the CPU every
10 msecs for at least 8 milliseconds". That fits the 80%, but how can you
make such a promise? If I have three such threads then they should each
get 8ms running time every 10ms. This would not work one one core. The
same problem will exist for multiple cores with more threads. Am I missing
something here?

You can't have 3 such threads if you only have 2 cores, the MM scheduler can
only make this happen if you have at least the number of processor cores
available to serve these number high priority threads.
Note also that a task (like "Pro Audio") defines the priority level at which
a registered thread will receive during it's 80% quantum, once the thread
has consumed this quantum, the MMCSS will drop the thread's priority level
to a per task configured value (1 for Pro Audio"). This guarantees that all
lower priority level get a chance to get at the CPU.

If I am guranteed that every 10 ms I get to run for at least 8ms, then I
will be happy. That should do the trick, and would be realtime enough for
me.

This what MMCSS is meant for, note however that threads running at a lower
priority level, but at an IRQL level > 0 (kernel mode threads only), will
still be able to spoil the party :-(.

Willy.
 
J

Joe Withawk

Note also that a task (like "Pro Audio") defines the priority level at
which a registered thread will receive during it's 80% quantum, once the
thread has consumed this quantum, the MMCSS will drop the thread's
priority level to a per task configured value (1 for Pro Audio"). This
guarantees that all lower priority level get a chance to get at the CPU.

What does that mean? The thread will only get this boost once and then need
to call the manager again to be elevated once more?
This what MMCSS is meant for, note however that threads running at a lower
priority level, but at an IRQL level > 0 (kernel mode threads only), will
still be able to spoil the party :-(.

I will just have to be satisfied with getting the best performance possible.
If some driver insists on stealing the cpu then there is nothing I, as
application developer, can do about it. I assume that I will generally never
see that as a problem. After all... the system is only supposed to run this
one application on an otherwise clean installation.
Thanks a lot for the info on this method. I had certainly never heard about
it before.
 
W

Willy Denoyette [MVP]

Joe Withawk said:
What does that mean? The thread will only get this boost once and then
need to call the manager again to be elevated once more?

No, the registered thread gets this boost for as long as you don't
de-register (by calling the AvRevertMmThreadCharacteristics )the handle
obtained from the AvSetMmThreadCharacteristics API call.

I will just have to be satisfied with getting the best performance
possible. If some driver insists on stealing the cpu then there is nothing
I, as application developer, can do about it. I assume that I will
generally never see that as a problem. After all... the system is only
supposed to run this one application on an otherwise clean installation.
Thanks a lot for the info on this method. I had certainly never heard
about it before.

This MM scheduler service is meant to prevent discontinuity in media
streaming ( especially audio ), by guaranteeing the registered thread(s) to
get the CPU at a fixed interval for a minimum period of time.

Willy.
 
J

Joe Withawk

No, the registered thread gets this boost for as long as you don't
de-register (by calling the AvRevertMmThreadCharacteristics )the handle
obtained from the AvSetMmThreadCharacteristics API call.

Thanks I will keep this in mind when testing on the new system. Should be
nice to have some certainty that I get to render at least that much :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top