Skip to content

Conversation

@wangchdo
Copy link
Contributor

@wangchdo wangchdo commented Dec 19, 2025

Summary

@Fix-Point tested the latest merged hrtimer implementation and found an issue where a running hrtimer is not allowed to be restarted.

This patch addresses the issue by optimizing the state machine as follows:

image

Impact

This patch fixes an issue in the standalone hrtimer module implementation, without affecting any other NuttX functions.

Testing

test code:(provided by @Fix-Point )

#include <nuttx/config.h>
#include <stdio.h>

#include <nuttx/hrtimer.h>

#define HRTIMER_TEST_THREAD_NR (CONFIG_SMP_NCPUS * 8)

/****************************************************************************
 * Public Functions
 ****************************************************************************/

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)
{
  printf("test_callback called!\n");
  return 0;
}

static void* test_thread(void *arg)
{
  irqstate_t  flags;
  hrtimer_t   timer;
  spinlock_t  lock = SP_UNLOCKED;
  hrtimer_init(&timer, test_callback, NULL);
  while (1)
    {
      uint64_t delay = rand() % NSEC_PER_MSEC;
      int ret;

      /* Simulate the usage of driver->wait_dog. */

      flags = spin_lock_irqsave(&lock);

      /* The driver lock acquired */

      /* First try, failed. Because hrtimer_start can not ensure the timer being started. */

      ret = hrtimer_cancel(&timer);
      // ret = hrtimer_start(&timer, 10 * NSEC_PER_USEC, HRTIMER_MODE_REL); /* May fail */

      /* This try-loop start should be OK. But it failed again.
       * Besides, we can not sleep or spin in the critical sections.
       */

      while (hrtimer_start(&timer, 10 * NSEC_PER_USEC, HRTIMER_MODE_REL) != OK);
      ret = OK;

      /* Second try, Success, but we can not sleep or spin in the critical section. */

      // ret = hrtimer_cancel_sync(&timer); /* Sleep in critical sections */
      // ret = hrtimer_start(&timer, delay, HRTIMER_MODE_REL);


      spin_unlock_irqrestore(&lock, flags);

      if (ret != OK)
        {
          printf("hrtimer_start failed\n");
        }
      up_ndelay(delay);
    }
  return NULL;
}

/****************************************************************************
 * hello_main
 ****************************************************************************/

int main(int argc, FAR char *argv[])
{
  unsigned int   thread_id;
  pthread_attr_t attr;
  pthread_t      pthreads[HRTIMER_TEST_THREAD_NR];

  printf("hrtimer_test start...\n");

  ASSERT(pthread_attr_init(&attr) == 0);

  /* Create wdog test thread */

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      ASSERT(pthread_create(&pthreads[thread_id], &attr,
                            test_thread, NULL) == 0);
    }

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      pthread_join(pthreads[thread_id], NULL);
    }

  ASSERT(pthread_attr_destroy(&attr) == 0);

  printf("hrtimer_test end...\n");
  return 0;
}

test log on rv-virt:smp64,

NuttShell (NSH)
nsh>
nsh> uname -a
NuttX 0.0.0 15e0a83-dirty Dec 19 2025 16:57:08 risc-v rv-virt
nsh>
nsh> hello

test callback ...
test callback ...
test callback ...
test callback ...
test callback ...
test callback ...
test callback ...
test callback ...
(....)

@github-actions github-actions bot added Area: OS Components OS Components issues Size: S The size of the change in this PR is small labels Dec 19, 2025
@wangchdo wangchdo force-pushed the fix_hrtimer_bug_on_smp branch from 15e0a83 to ae05602 Compare December 19, 2025 09:04
acassis
acassis previously approved these changes Dec 19, 2025
Copy link
Contributor

@acassis acassis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangchdo this diagram that you added to Summary should become part the Documentation about hrtimer

@acassis
Copy link
Contributor

acassis commented Dec 19, 2025

@wangchdo please create a nuttx-apps/testing/hrtime/thread_test with the code you added it

@Fix-Point
Copy link
Contributor

Fix-Point commented Dec 19, 2025

Please pay more respect to parallel programming. It does not look like a simple change can fix this at all.

I hope that before you claim you fix the issue, you can review your state-machine more carefully. I must say sorry that I have not too much time to review your commit. I am not your z3 SMT solver who can always automatically generate the counter-examples for you. I think it is impossible (or very hard) to allow restart the timer correctly unless refactoring your state-machine design. At least, I can not find a solution that can ensure functional correctness for your design. That is why I recommend refactoring your implementation with #17556.

For example, When the newly-set timer (triggered at t2) while the old timer callback function (triggered at t1) is executing simultaneously. This is a highly likely SMP scenario of thread interleaving. The old timer callback returns next as UINT32_MAX, while the new request callback returns next as 1000. However, since the old timer callback detects that the hrtimer is in the running state, so it sets the next timer to (t2 + UINT32_MAX) instead of (t1 + UINT32_MAX). I suspect that the reason your patch fails to run ostest on rv-virt:smp is also due to this issue.

Timeline:
CPU1 (Timer Callback)                     CPU2 (Set New Timer)
------|--------------------------------------|-------------------------
      |                                      |
      t1: Old timer triggers at t1           |
      |--- Callback starts                   |
      |   hrtimer->state <- running          | [Lock]
      | [Unlock]                             t2: New timer being start(t2)
      |                                      |--- hrtimer_start()
      |                                      |    hrtimer->state <- armed
      |   ...                                | [Unlock]
      |                                      | ...
      |   Callback executes...               | [Lock]
      |                                      |--- New timer triggered
      |                                      |    hrtimer->state <- running
      |                                      | [Unlock]
      |                                      |    Calllback executes....
      |                                      |
      |   Returns next = UINT32_MAX          |
      | [Lock]                               |
      | if (hrtimer->state == running)       |
      |   next expired                       | 
      |    = t2 + UINT32_MAX (wrong!)        | 
      |   hrtimer->state <- armed            | 
      | [Unlock]                             |
      |                                      |  [Lock]
      |                                      |--- hrtimer->state != running
      |                                      |    failed to set next (wrong!)
      |                                      |
------|--------------------------------------|-------------------------

By the way, here is the another simple test-case that fails that I do not even know why. It stucked at hrtimer_cancel_sync(&timer);. Please fix this problem first.

I must say that I am not a lab rat or your tester. Please ensure the functional correctness of your patches before merging it to upstream. I fully agree with @anchao, who pointed out my problem in #16342 before.

emm ... community is not a testing ground, please do not treating everyone like lab rats.

After that, I am trying my best to ensure my PR is functional correctness. I hope you too.

/****************************************************************************
 * apps/examples/hello/hello_main.c
 *
 * SPDX-License-Identifier: Apache-2.0
 *
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.  The
 * ASF licenses this file to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance with the
 * License.  You may obtain a copy of the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the
 * License for the specific language governing permissions and limitations
 * under the License.
 *
 ****************************************************************************/

/****************************************************************************
 * Included Files
 ****************************************************************************/

#include <nuttx/config.h>
#include <stdio.h>
#include <unistd.h>

#include <nuttx/hrtimer.h>

#define HRTIMER_TEST_THREAD_NR (1)
#define HRTIMER_TEST_NR        (1000000)

/****************************************************************************
 * Public Functions
 ****************************************************************************/

static int volatile tcount = 0;
static volatile uint32_t next = 0;
static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)
{
  tcount++;
  up_ndelay(hrtimer->expired % (10 * NSEC_PER_USEC));
  return 0;
}

static uint32_t test_callback_background(FAR struct hrtimer_s *hrtimer)
{
  up_ndelay(hrtimer->expired % NSEC_PER_USEC);
  return 0;
}


static void test1(int tid)
{
  hrtimer_t   timer;
  int         count = 0;
  irqstate_t flags;
  spinlock_t lock;

  if (tid == 0)
    {
      hrtimer_init(&timer, test_callback, NULL);
    }
  else
    {
      hrtimer_init(&timer, test_callback_background, NULL);
    }

  while (count++ < HRTIMER_TEST_NR)
    {
      int ret;
      if (tid == 0)
        {
          uint64_t delay = rand() % (10 * NSEC_PER_MSEC);

          /* Simulate the periodical hrtimer.. */

          flags = spin_lock_irqsave(&lock);

          /* Use as periodical timer */

          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, 1000, HRTIMER_MODE_REL);

          spin_unlock_irqrestore(&lock, flags);

          up_ndelay(NSEC_PER_MSEC);

          flags = spin_lock_irqsave(&lock);

          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, 1000, HRTIMER_MODE_REL);
          spin_unlock_irqrestore(&lock, flags);

          up_ndelay(NSEC_PER_MSEC);

          hrtimer_cancel_sync(&timer); // stucked here????
          printf("???\n");
        }
      else
        {
          /* Simulate the background hrtimer.. */

          uint64_t delay = rand() % (10 * NSEC_PER_MSEC);

          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, delay, HRTIMER_MODE_REL);
        }

      UNUSED(ret);
    }
}

static void* test_thread(void *arg)
{
  while (1)
    {
      test1((int)arg);
    }
  return NULL;
}
/****************************************************************************
 * hello_main
 ****************************************************************************/

int main(int argc, FAR char *argv[])
{
  unsigned int   thread_id;
  pthread_attr_t attr;
  pthread_t      pthreads[HRTIMER_TEST_THREAD_NR];

  printf("hrtimer_test start...\n");

  ASSERT(pthread_attr_init(&attr) == 0);

  /* Create wdog test thread */

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      ASSERT(pthread_create(&pthreads[thread_id], &attr,
                            test_thread, (void *)thread_id) == 0);
    }

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      pthread_join(pthreads[thread_id], NULL);
    }

  ASSERT(pthread_attr_destroy(&attr) == 0);

  printf("hrtimer_test end...\n");
  return 0;
}

@wangchdo
Copy link
Contributor Author

wangchdo commented Dec 19, 2025

Please pay more respect to parallel programming. It does not look like a simple change can fix this at all.

why this change is not simple? why fixing issue is not respecting you?

I hope that before you claim you fix the issue, you can review your state-machine more carefully. I must say sorry that I have not too much time to review your commit. I am not your z3 SMT solver who can always automatically generate the counter-examples for you. I think it is impossible (or very hard) to allow restart the timer correctly unless refactoring your state-machine design. At least, I can not find a solution that can ensure functional correctness for your design. That is why I recommend refactoring your implementation with #17556.

For example, When the newly-set timer (triggered at t2) while the old timer callback function (triggered at t1) is executing simultaneously. This is a highly likely SMP scenario of thread interleaving. The old timer callback returns next as UINT32_MAX, while the new request callback returns next as 1000. However, since the old timer callback detects that the hrtimer is in the running state, so it sets the next timer to (t2 + UINT32_MAX) instead of (t1 + UINT32_MAX). I suspect that the reason your patch fails to run ostest on rv-virt:smp is also due to this issue.

I already showed that ostest passed on rv-virt:smp64

Timeline:

CPU1 (Timer Callback)                     CPU2 (Set New Timer)

------|--------------------------------------|-------------------------

      |                                      |

      t1: Old timer triggers at t1           |

      |--- Callback starts                   |

      |   hrtimer->state <- running          | [Lock]

      | [Unlock]                             t2: New timer being start(t2)

      |                                      |--- hrtimer_start()

      |                                      |    hrtimer->state <- armed

      |   ...                                | [Unlock]

      |                                      | ...

      |   Callback executes...               | [Lock]

      |                                      |--- New timer triggered

      |                                      |    hrtimer->state <- running

      |                                      | [Unlock]

      |                                      |    Calllback executes....

      |                                      |

      |   Returns next = UINT32_MAX          |

      | [Lock]                               |

      | if (hrtimer->state == running)       |

      |   next expired                       | 

      |    = t2 + UINT32_MAX (wrong!)        | 

      |   hrtimer->state <- armed            | 

      | [Unlock]                             |

      |                                      |  [Lock]

      |                                      |--- hrtimer->state != running

      |                                      |    failed to set next (wrong!)

      |                                      |

------|--------------------------------------|-------------------------

I don’t think the case you pointed out is an issue, as allowing an hrtimer that has just been re-armed to be restarted again is purely a design choice. My initial design allowed re-arming an already armed hrtimer. After considering @xiaoxiang781216’s comments, I have revised the implementation to treat this case as an error.

By the way, here is the another simple test-case that fails that I do not even know why. It stucked at hrtimer_cancel_sync(&timer);. Please fix this problem first.

I must say that I am not a lab rat or your tester. Please ensure the functional correctness of your patches before merging it to upstream. I fully agree with @anchao, who pointed out my problem in #16342 before.

emm ... community is not a testing ground, please do not treating everyone like lab rats.

After that, I am trying my best to ensure my PR is functional correctness. I hope you too.

/****************************************************************************

 * apps/examples/hello/hello_main.c

 *

 * SPDX-License-Identifier: Apache-2.0

 *

 * Licensed to the Apache Software Foundation (ASF) under one or more

 * contributor license agreements.  See the NOTICE file distributed with

 * this work for additional information regarding copyright ownership.  The

 * ASF licenses this file to you under the Apache License, Version 2.0 (the

 * "License"); you may not use this file except in compliance with the

 * License.  You may obtain a copy of the License at

 *

 *   http://www.apache.org/licenses/LICENSE-2.0

 *

 * Unless required by applicable law or agreed to in writing, software

 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT

 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the

 * License for the specific language governing permissions and limitations

 * under the License.

 *

 ****************************************************************************/



/****************************************************************************

 * Included Files

 ****************************************************************************/



#include <nuttx/config.h>

#include <stdio.h>

#include <unistd.h>



#include <nuttx/hrtimer.h>



#define HRTIMER_TEST_THREAD_NR (1)

#define HRTIMER_TEST_NR        (1000000)



/****************************************************************************

 * Public Functions

 ****************************************************************************/



static int volatile tcount = 0;

static volatile uint32_t next = 0;

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)

{

  tcount++;

  up_ndelay(hrtimer->expired % (10 * NSEC_PER_USEC));

  return 0;

}



static uint32_t test_callback_background(FAR struct hrtimer_s *hrtimer)

{

  up_ndelay(hrtimer->expired % NSEC_PER_USEC);

  return 0;

}





static void test1(int tid)

{

  hrtimer_t   timer;

  int         count = 0;

  irqstate_t flags;

  spinlock_t lock;



  if (tid == 0)

    {

      hrtimer_init(&timer, test_callback, NULL);

    }

  else

    {

      hrtimer_init(&timer, test_callback_background, NULL);

    }



  while (count++ < HRTIMER_TEST_NR)

    {

      int ret;

      if (tid == 0)

        {

          uint64_t delay = rand() % (10 * NSEC_PER_MSEC);



          /* Simulate the periodical hrtimer.. */



          flags = spin_lock_irqsave(&lock);



          /* Use as periodical timer */



          ret = hrtimer_cancel(&timer);

          ret = hrtimer_start(&timer, 1000, HRTIMER_MODE_REL);



          spin_unlock_irqrestore(&lock, flags);



          up_ndelay(NSEC_PER_MSEC);



          flags = spin_lock_irqsave(&lock);



          ret = hrtimer_cancel(&timer);

          ret = hrtimer_start(&timer, 1000, HRTIMER_MODE_REL);

          spin_unlock_irqrestore(&lock, flags);



          up_ndelay(NSEC_PER_MSEC);



          hrtimer_cancel_sync(&timer); // stucked here????

          printf("???\n");

        }

      else

        {

          /* Simulate the background hrtimer.. */



          uint64_t delay = rand() % (10 * NSEC_PER_MSEC);



          ret = hrtimer_cancel(&timer);

          ret = hrtimer_start(&timer, delay, HRTIMER_MODE_REL);

        }



      UNUSED(ret);

    }

}



static void* test_thread(void *arg)

{

  while (1)

    {

      test1((int)arg);

    }

  return NULL;

}

/****************************************************************************

 * hello_main

 ****************************************************************************/



int main(int argc, FAR char *argv[])

{

  unsigned int   thread_id;

  pthread_attr_t attr;

  pthread_t      pthreads[HRTIMER_TEST_THREAD_NR];



  printf("hrtimer_test start...\n");



  ASSERT(pthread_attr_init(&attr) == 0);



  /* Create wdog test thread */



  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

    {

      ASSERT(pthread_create(&pthreads[thread_id], &attr,

                            test_thread, (void *)thread_id) == 0);

    }



  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

    {

      pthread_join(pthreads[thread_id], NULL);

    }



  ASSERT(pthread_attr_destroy(&attr) == 0);



  printf("hrtimer_test end...\n");

  return 0;

}

this is a minor issue, i just forgot to set hritmer to inactive when period is 0 according to the state machine I designe, it is fixed now:


case HRTIMER_STATE_RUNNING:
            {
              /* Timer callback completed normally */

              if (period > 0)
                {
                  hrtimer->expired += period;
                  hrtimer->state = HRTIMER_STATE_ARMED;
                  RB_INSERT(hrtimer_tree_s, &g_hrtimer_tree,
                            &hrtimer->node);
                }
              else
                {
                  hrtimer->state = HRTIMER_STATE_INACTIVE;
                }

              break;
            }

and your new test passed on rv-virt:smp64 when this update is added:

NuttShell (NSH)
nsh> 
nsh> 
nsh> uname -a
NuttX 0.0.0 02057dc356-dirty Dec 20 2025 00:21:54 risc-v rv-virt
nsh> 
nsh> hello

???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???

(...)

By the way, I have implemented hrtimer as an independent module, so even if it is not perfect, it does not impact existing NuttX functionality.
I also want to emphasize that I respect you very much and have no intention of offending you. My concern is simply that spending a significant amount of time re-inventing the wheel—and claiming that a new implementation is superior in every aspect—may not be the most productive approach. In my opinion, building improvements on top of an existing implementation could be a more efficient and constructive direction.

Copy link
Contributor

@acassis acassis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the hrtimer Documentation and add this diagram

@wangchdo wangchdo requested a review from acassis December 20, 2025 03:57
@github-actions github-actions bot added the Area: Documentation Improvements or additions to documentation label Dec 20, 2025
@wangchdo
Copy link
Contributor Author

Please update the hrtimer Documentation and add this diagram

Hi @acassis,

I’ve added the state-machine diagram to the hrtimer documentation. Could you please take another look?

Thanks!

@Fix-Point
Copy link
Contributor

Fix-Point commented Dec 20, 2025

Please pay more respect to parallel programming. It does not look like a simple change can fix this at all.

why this change is not simple? why fixing issue is not respecting you?

I hope that before you claim you fix the issue, you can review your state-machine more carefully. I must say sorry that I have not too much time to review your commit. I am not your z3 SMT solver who can always automatically generate the counter-examples for you. I think it is impossible (or very hard) to allow restart the timer correctly unless refactoring your state-machine design. At least, I can not find a solution that can ensure functional correctness for your design. That is why I recommend refactoring your implementation with #17556.
For example, When the newly-set timer (triggered at t2) while the old timer callback function (triggered at t1) is executing simultaneously. This is a highly likely SMP scenario of thread interleaving. The old timer callback returns next as UINT32_MAX, while the new request callback returns next as 1000. However, since the old timer callback detects that the hrtimer is in the running state, so it sets the next timer to (t2 + UINT32_MAX) instead of (t1 + UINT32_MAX). I suspect that the reason your patch fails to run ostest on rv-virt:smp is also due to this issue.

I already showed that ostest passed on rv-virt:smp64

Timeline:

CPU1 (Timer Callback)                     CPU2 (Set New Timer)

------|--------------------------------------|-------------------------

      |                                      |

      t1: Old timer triggers at t1           |

      |--- Callback starts                   |

      |   hrtimer->state <- running          | [Lock]

      | [Unlock]                             t2: New timer being start(t2)

      |                                      |--- hrtimer_start()

      |                                      |    hrtimer->state <- armed

      |   ...                                | [Unlock]

      |                                      | ...

      |   Callback executes...               | [Lock]

      |                                      |--- New timer triggered

      |                                      |    hrtimer->state <- running

      |                                      | [Unlock]

      |                                      |    Calllback executes....

      |                                      |

      |   Returns next = UINT32_MAX          |

      | [Lock]                               |

      | if (hrtimer->state == running)       |

      |   next expired                       | 

      |    = t2 + UINT32_MAX (wrong!)        | 

      |   hrtimer->state <- armed            | 

      | [Unlock]                             |

      |                                      |  [Lock]

      |                                      |--- hrtimer->state != running

      |                                      |    failed to set next (wrong!)

      |                                      |

------|--------------------------------------|-------------------------

I don’t think the case you pointed out is an issue, as allowing an hrtimer that has just been re-armed to be restarted again is purely a design choice. My initial design allowed re-arming an already armed hrtimer. After considering @xiaoxiang781216’s comments, I have revised the implementation to treat this case as an error.

By the way, here is the another simple test-case that fails that I do not even know why. It stucked at hrtimer_cancel_sync(&timer);. Please fix this problem first.
I must say that I am not a lab rat or your tester. Please ensure the functional correctness of your patches before merging it to upstream. I fully agree with @anchao, who pointed out my problem in #16342 before.

emm ... community is not a testing ground, please do not treating everyone like lab rats.

After that, I am trying my best to ensure my PR is functional correctness. I hope you too.

/****************************************************************************

 * apps/examples/hello/hello_main.c

 *

 * SPDX-License-Identifier: Apache-2.0

 *

 * Licensed to the Apache Software Foundation (ASF) under one or more

 * contributor license agreements.  See the NOTICE file distributed with

 * this work for additional information regarding copyright ownership.  The

 * ASF licenses this file to you under the Apache License, Version 2.0 (the

 * "License"); you may not use this file except in compliance with the

 * License.  You may obtain a copy of the License at

 *

 *   http://www.apache.org/licenses/LICENSE-2.0

 *

 * Unless required by applicable law or agreed to in writing, software

 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT

 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the

 * License for the specific language governing permissions and limitations

 * under the License.

 *

 ****************************************************************************/



/****************************************************************************

 * Included Files

 ****************************************************************************/



#include <nuttx/config.h>

#include <stdio.h>

#include <unistd.h>



#include <nuttx/hrtimer.h>



#define HRTIMER_TEST_THREAD_NR (1)

#define HRTIMER_TEST_NR        (1000000)



/****************************************************************************

 * Public Functions

 ****************************************************************************/



static int volatile tcount = 0;

static volatile uint32_t next = 0;

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)

{

  tcount++;

  up_ndelay(hrtimer->expired % (10 * NSEC_PER_USEC));

  return 0;

}



static uint32_t test_callback_background(FAR struct hrtimer_s *hrtimer)

{

  up_ndelay(hrtimer->expired % NSEC_PER_USEC);

  return 0;

}





static void test1(int tid)

{

  hrtimer_t   timer;

  int         count = 0;

  irqstate_t flags;

  spinlock_t lock;



  if (tid == 0)

    {

      hrtimer_init(&timer, test_callback, NULL);

    }

  else

    {

      hrtimer_init(&timer, test_callback_background, NULL);

    }



  while (count++ < HRTIMER_TEST_NR)

    {

      int ret;

      if (tid == 0)

        {

          uint64_t delay = rand() % (10 * NSEC_PER_MSEC);



          /* Simulate the periodical hrtimer.. */



          flags = spin_lock_irqsave(&lock);



          /* Use as periodical timer */



          ret = hrtimer_cancel(&timer);

          ret = hrtimer_start(&timer, 1000, HRTIMER_MODE_REL);



          spin_unlock_irqrestore(&lock, flags);



          up_ndelay(NSEC_PER_MSEC);



          flags = spin_lock_irqsave(&lock);



          ret = hrtimer_cancel(&timer);

          ret = hrtimer_start(&timer, 1000, HRTIMER_MODE_REL);

          spin_unlock_irqrestore(&lock, flags);



          up_ndelay(NSEC_PER_MSEC);



          hrtimer_cancel_sync(&timer); // stucked here????

          printf("???\n");

        }

      else

        {

          /* Simulate the background hrtimer.. */



          uint64_t delay = rand() % (10 * NSEC_PER_MSEC);



          ret = hrtimer_cancel(&timer);

          ret = hrtimer_start(&timer, delay, HRTIMER_MODE_REL);

        }



      UNUSED(ret);

    }

}



static void* test_thread(void *arg)

{

  while (1)

    {

      test1((int)arg);

    }

  return NULL;

}

/****************************************************************************

 * hello_main

 ****************************************************************************/



int main(int argc, FAR char *argv[])

{

  unsigned int   thread_id;

  pthread_attr_t attr;

  pthread_t      pthreads[HRTIMER_TEST_THREAD_NR];



  printf("hrtimer_test start...\n");



  ASSERT(pthread_attr_init(&attr) == 0);



  /* Create wdog test thread */



  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

    {

      ASSERT(pthread_create(&pthreads[thread_id], &attr,

                            test_thread, (void *)thread_id) == 0);

    }



  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

    {

      pthread_join(pthreads[thread_id], NULL);

    }



  ASSERT(pthread_attr_destroy(&attr) == 0);



  printf("hrtimer_test end...\n");

  return 0;

}

this is a minor issue, i just forgot to set hritmer to inactive when period is 0 according to the state machine I designe, it is fixed now:


case HRTIMER_STATE_RUNNING:
            {
              /* Timer callback completed normally */

              if (period > 0)
                {
                  hrtimer->expired += period;
                  hrtimer->state = HRTIMER_STATE_ARMED;
                  RB_INSERT(hrtimer_tree_s, &g_hrtimer_tree,
                            &hrtimer->node);
                }
              else
                {
                  hrtimer->state = HRTIMER_STATE_INACTIVE;
                }

              break;
            }

and your new test passed on rv-virt:smp64 when this update is added:

NuttShell (NSH)
nsh> 
nsh> 
nsh> uname -a
NuttX 0.0.0 02057dc356-dirty Dec 20 2025 00:21:54 risc-v rv-virt
nsh> 
nsh> hello

???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???

(...)

By the way, I have implemented hrtimer as an independent module, so even if it is not perfect, it does not impact existing NuttX functionality. I also want to emphasize that I respect you very much and have no intention of offending you. My concern is simply that spending a significant amount of time re-inventing the wheel—and claiming that a new implementation is superior in every aspect—may not be the most productive approach. In my opinion, building improvements on top of an existing implementation could be a more efficient and constructive direction.

I sincerely apologize for providing incorrect test cases earlier. Due to time constraints, I did not have the opportunity to review them promptly, which unfortunately wasted your time. To address this, I have now provided a reproducible test case that demonstrates the kind of thread interleaving I mentioned.

#include <nuttx/config.h>
#include <stdio.h>

#include <nuttx/hrtimer.h>
#include <sys/resource.h>

/****************************************************************************
 * Name: hrtimer_process
 *
 * Description:
 *   Called from the timer interrupt handler to process expired
 *   high-resolution timers. If a timer has expired, its callback
 *   function will be executed in the context of the timer interrupt.
 *
 * Input Parameters:
 *   now - The current time (nsecs).
 *
 * Returned Value:
 *   None
 ****************************************************************************/

void hrtimer_process(uint64_t now);

/****************************************************************************
 * Inline Functions
 ****************************************************************************/

/****************************************************************************
 * Name: hrtimer_gettime
 *
 * Description:
 *   Get the current high-resolution time in nanoseconds.
 *
 * Returned Value:
 *   Current time in nanoseconds.
 ****************************************************************************/

static inline_function
uint64_t hrtimer_gettime(void)
{
  struct timespec ts;

  /* Get current time from platform-specific timer */

  clock_systime_timespec(&ts);

  /* Convert timespec to nanoseconds */

  return clock_time2nsec(&ts);
}

#define HRTIMER_TEST_THREAD_NR (4)

/****************************************************************************
 * Public Functions
 ****************************************************************************/

static volatile int count = 0;
static hrtimer_t    timer;
static spinlock_t   lock = SP_UNLOCKED;

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)
{
  irqstate_t flags;
  up_ndelay(10 * NSEC_PER_USEC);
  flags = spin_lock_irqsave(&lock);
  count++;
  spin_unlock_irqrestore(&lock, flags);
  return 0;
}

static uint32_t test_callback_back(FAR struct hrtimer_s *hrtimer)
{
  return 0;
}

static void* test_thread(void *arg)
{
  irqstate_t  flags;
  int         tid   = (int)arg;
  int         ret;

  if (tid == 0)
    {
      hrtimer_init(&timer, test_callback, NULL);
    }

  while (1)
    {
      if (tid == 0)
        {
          int tmp_count;
          flags = spin_lock_irqsave(&lock);
          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
          spin_unlock_irqrestore(&lock, flags);

          up_ndelay(1000);

          spin_lock_irqsave(&lock);
          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
          spin_unlock_irqrestore(&lock, flags);

          ret = hrtimer_cancel_sync(&timer); // timer should be fully-cancelled here.
          tmp_count = count; // Read the count
          printf("%d\n", tmp_count);
          ASSERT(tmp_count == count); // This should not assert since timer has been cancelled.
        }
      else
        {
          flags = up_irq_save();
          hrtimer_process(hrtimer_gettime());
          up_irq_restore(flags);
        }
    }

  UNUSED(ret);

  return NULL;
}

/****************************************************************************
 * hello_main
 ****************************************************************************/

int main(int argc, FAR char *argv[])
{
  unsigned int   thread_id;
  pthread_attr_t attr;
  pthread_t      pthreads[HRTIMER_TEST_THREAD_NR];

  printf("hrtimer_test start...\n");

  ASSERT(pthread_attr_init(&attr) == 0);

  /* Create wdog test thread */

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      ASSERT(pthread_create(&pthreads[thread_id], &attr,
                            test_thread, (void *)thread_id) == 0);
    }

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      pthread_join(pthreads[thread_id], NULL);
    }

  ASSERT(pthread_attr_destroy(&attr) == 0);

  printf("hrtimer_test end...\n");
  return 0;
}

In this scenario, a timer that should have been canceled continues to execute. This could lead to very subtle concurrency bugs (such as use-after-free behavior). If such issues make their way into the production environment, they could result in hundreds or even thousands of man-hours being spent on debugging, causing significant losses. Therefore, I find your claim that this is not a problem to be unconvincing.

I support @anchao’s viewpoint that we should design HRTimer from the user's perspective. In #17556, such issues are guaranteed not to occur.
As I have mentioned before, there are fundamental flaws in the underlying state machine design of your hrtimer. Minor modifications cannot fix these issues. Therefore, I believe the most efficient approach is to replace your implementation with #17556. Honestly, your attempts, in my view, are good but futile. Because the root of the problem is the violation of ownership invariants. If you insist on finding a correct solution through trial and error, I can only respect your decision and wish you the best.

Regarding the accusation of “reinventing the wheel,” I must counter your argument—

  • Strive for excellence. Today, there are numerous embedded RTOS options available. If NuttX aims to attract more users, it must excel in every aspect and prove itself superior to other RTOSes. Even a 1% optimization in CPU time overhead or a few bytes of memory savings can benefit tens of millions of IoT devices running NuttX, saving substantial energy and hardware costs. In my opinion, every byte and every CPU cycle should be optimized as much as possible. I hope you share this mindset.

  • As educated engineers, we should strive to deliver our best work. However, pursuing excellence demands significant time and effort—if you are familiar with the Pareto principle, you’ll understand that achieving 80% often requires only 1% of the effort, but perfecting the remaining 20% demands 99% of the effort. Please do not assume that those who pursue this level of refinement are wasting their time. This is extremely disrespectful.

Furthermore, as @anchao emphasized, the community is not a testing ground. We should at least ensure basic functional correctness before merging code into the main branch. Please refrain from hastily integrating flawed code like this in the future.

Finally, over the coming weeks, I will focus on merging #17556. I do not have the time to repeatedly point out the fundamental issues in your underlying state machine design. I hope you understand. If you have the time, I would appreciate it if you could review my design for any functional correctness issues or provide suggestions for improvement. Thank you.

qemu-system-riscv32 -semihosting -M virt,aclint=on -cpu rv32 -smp 8 -bios none -nographic -kernel nuttx -s
ABC
NuttShell (NSH) NuttX-12.11.0
nsh>
nsh> hello
hrtimer_test start...
1
2
3
4
5
5
[CPU3] Current Version: NuttX  12.11.0 290d47b4da-dirty Dec 20 2025 12:23:59 risc-v
[CPU3] Assertion failed : at file: :0 task(CPU3): hello process: hello 0x80017690
[CPU3] EPC: 800021cc
[CPU3] A0: 8002b774 A1: 00000000 A2: 00000003 A3: 80033090
[CPU3] A4: 00000002 A5: 00000001 A6: 00000000 A7: 00000001
[CPU3] T0: 0000002e T1: 0000006a T2: 000001ff T3: 0000004c
[CPU3] T4: 00000068 T5: 00000009 T6: 0000002a
[CPU3] S0: 0000018c S1: 00000000 S2: 8002b2d0 S3: 80033090
[CPU3] S4: 00000000 S5: 00000000 S6: 8002d000 S7: 00002088
[CPU3] S8: 00000006 S9: 00000000 S10: 00000000 S11: 00000000
[CPU3] SP: 80033800 FP: 0000018c TP: 00000000 RA: 800021cc
[CPU3] User Stack:
[CPU3]   base: 0x80033178
[CPU3]   size: 00002024
[CPU3]     sp: 0x80033800
[CPU3] 0x800337e0: 8002d000 800227d4 800227d4 80033090 8002b2d0 8002b774 800324e8 8000228c
[CPU3] 0x80033800: 80017690 00000000 00000000 00000002 0000000a 800338ac 00000000 80008346
[CPU3] 0x80033820: 00000000 8001e424 80033090 8002b774 00000000 00000000 00000000 7474754e
[CPU3] 0x80033840: 00000058 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[CPU3] 0x80033860: 00000000 00000000 00000000 00000000 2e323100 302e3131 80025000 8002d000
[CPU3] 0x80033880: 8002d000 393257d4 37346430 61643462 7269642d 44207974 32206365 30322030
[CPU3] 0x800338a0: 31203532 33323a32 0039353a 00000002 8001cec0 8001ce6a 736972de 00762d63
[CPU3] 0x800338c0: 00002088 00000005 800257d4 8001c372 00000000 00000000 00000000 00000000
[CPU3] 0x800338e0: 00000000 8002d404 80025000 8002d000 8002d000 00000005 8002d000 80007bcc
[CPU3] 0x80033900: 00000001 00000000 ee6b2800 80017798 00000000 00000000 00000000 00000000
[CPU3] 0x80033920: 00000000 00000000 00000000 00000000 00000000 00000000 80033090 8001bf0c
[CPU3] 0x80033940: 00000000 00000000 00000000 8001d36a 00000000 00000000 00000000 00000000
[CPU3] 0x80033960: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
nsh>

@wangchdo
Copy link
Contributor Author

@Fix-Point You need to test this hrtimer module according to its design, not according to your own assumption.

@Fix-Point
Copy link
Contributor

Fix-Point commented Dec 20, 2025

@Fix-Point You need to test this hrtimer module according to its design, not according to your own assumption.

This is not my own assumption. Such a bug actually occurred in our systems. Please do not underestimate the complexity of the SMP systems.

I shoud remind you that the test case I provided only maximized the probability of BUG occurrence. In production, there is a very small probability of triggering the BUG, and it is difficult to find the root-cause and fix, which can cause countless losses.

@wangchdo
Copy link
Contributor Author

@Fix-Point You need to test this hrtimer module according to its design, not according to your own assumption.

This is not my own assumption. Such a bug actually occurred in our system. Please do not underestimate the complexity of the SMP system.

I suggest you stop teaching others as a teacher, it does not help anything - this time SMP, last time I remember DO-178, ISO 26262. And also saying words like mine is completely unusable, your's are better in every aspects is also not helpful.

@Fix-Point
Copy link
Contributor

@Fix-Point You need to test this hrtimer module according to its design, not according to your own assumption.

This is not my own assumption. Such a bug actually occurred in our system. Please do not underestimate the complexity of the SMP system.

I suggest you stop teaching others as a teacher, it does not help anything - this time SMP, last time I remember DO-178, ISO 26262. And also saying words like mine is completely unusable, your's are better in every aspects is also not helpful.

I'm truly sorry, that was entirely my fault. I became a bit emotional after being accused of copying someone else's idea without basis. Actually, I never intended to argue with anyone—I only wanted to discuss technical matters and made the NuttX better. Thank you for pointing out the issue; I'll try my best to avoid using such offensive language in the future.

@wangchdo
Copy link
Contributor Author

wangchdo commented Dec 20, 2025

@Fix-Point You need to test this hrtimer module according to its design, not according to your own assumption.

This is not my own assumption. Such a bug actually occurred in our system. Please do not underestimate the complexity of the SMP system.

I suggest you stop teaching others as a teacher, it does not help anything - this time SMP, last time I remember DO-178, ISO 26262. And also saying words like mine is completely unusable, your's are better in every aspects is also not helpful.

I'm truly sorry, that was entirely my fault. I became a bit emotional after being accused of copying someone else's idea without basis. Actually, I never intended to argue with anyone—I only wanted to discuss technical matters and made the NuttX better. Thank you for pointing out the issue; I'll try my best to avoid using such offensive language in the future.

Also, you can check my hrtimer part2 #17573, I think the main design here on supporting os tick with hrtimer are totally diffrent than your previous PR(#17556) now:

Hrtimer is strictly isolated from the current scheduling logic:

  1. All hrtimer code is enabled only when CONFIG_HRTIMER is set.
  2. If CONFIG_HRTIMER is disabled, the scheduler continues to use the existing tick-based or tickless mechanisms with no changes.
  3. When enabled, hrtimer reuses the existing scheduler logic to drive OS ticks via a dedicated hrtimer instance.

@wangchdo wangchdo force-pushed the fix_hrtimer_bug_on_smp branch 2 times, most recently from 612a158 to 81e074a Compare December 20, 2025 06:53
@wangchdo wangchdo changed the title sched/hrtimr: Allow running hrtimer to be restarted sched/hrtimr: Update hrtimer state-machine to allow RUNNING/ARMED hrtimer to be restarted Dec 20, 2025
@wangchdo
Copy link
Contributor Author

wangchdo commented Dec 20, 2025

Please pay more respect to parallel programming. It does not look like a simple change can fix this at all.

why this change is not simple? why fixing issue is not respecting you?

I hope that before you claim you fix the issue, you can review your state-machine more carefully. I must say sorry that I have not too much time to review your commit. I am not your z3 SMT solver who can always automatically generate the counter-examples for you. I think it is impossible (or very hard) to allow restart the timer correctly unless refactoring your state-machine design. At least, I can not find a solution that can ensure functional correctness for your design. That is why I recommend refactoring your implementation with #17556.
For example, When the newly-set timer (triggered at t2) while the old timer callback function (triggered at t1) is executing simultaneously. This is a highly likely SMP scenario of thread interleaving. The old timer callback returns next as UINT32_MAX, while the new request callback returns next as 1000. However, since the old timer callback detects that the hrtimer is in the running state, so it sets the next timer to (t2 + UINT32_MAX) instead of (t1 + UINT32_MAX). I suspect that the reason your patch fails to run ostest on rv-virt:smp is also due to this issue.

I already showed that ostest passed on rv-virt:smp64

Timeline:

CPU1 (Timer Callback)                     CPU2 (Set New Timer)

------|--------------------------------------|-------------------------

      |                                      |

      t1: Old timer triggers at t1           |

      |--- Callback starts                   |

      |   hrtimer->state <- running          | [Lock]

      | [Unlock]                             t2: New timer being start(t2)

      |                                      |--- hrtimer_start()

      |                                      |    hrtimer->state <- armed

      |   ...                                | [Unlock]

      |                                      | ...

      |   Callback executes...               | [Lock]

      |                                      |--- New timer triggered

      |                                      |    hrtimer->state <- running

      |                                      | [Unlock]

      |                                      |    Calllback executes....

      |                                      |

      |   Returns next = UINT32_MAX          |

      | [Lock]                               |

      | if (hrtimer->state == running)       |

      |   next expired                       | 

      |    = t2 + UINT32_MAX (wrong!)        | 

      |   hrtimer->state <- armed            | 

      | [Unlock]                             |

      |                                      |  [Lock]

      |                                      |--- hrtimer->state != running

      |                                      |    failed to set next (wrong!)

      |                                      |

------|--------------------------------------|-------------------------

I don’t think the case you pointed out is an issue, as allowing an hrtimer that has just been re-armed to be restarted again is purely a design choice. My initial design allowed re-arming an already armed hrtimer. After considering @xiaoxiang781216’s comments, I have revised the implementation to treat this case as an error.

By the way, here is the another simple test-case that fails that I do not even know why. It stucked at hrtimer_cancel_sync(&timer);. Please fix this problem first.
I must say that I am not a lab rat or your tester. Please ensure the functional correctness of your patches before merging it to upstream. I fully agree with @anchao, who pointed out my problem in #16342 before.

emm ... community is not a testing ground, please do not treating everyone like lab rats.

After that, I am trying my best to ensure my PR is functional correctness. I hope you too.

/****************************************************************************

 * apps/examples/hello/hello_main.c

 *

 * SPDX-License-Identifier: Apache-2.0

 *

 * Licensed to the Apache Software Foundation (ASF) under one or more

 * contributor license agreements.  See the NOTICE file distributed with

 * this work for additional information regarding copyright ownership.  The

 * ASF licenses this file to you under the Apache License, Version 2.0 (the

 * "License"); you may not use this file except in compliance with the

 * License.  You may obtain a copy of the License at

 *

 *   http://www.apache.org/licenses/LICENSE-2.0

 *

 * Unless required by applicable law or agreed to in writing, software

 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT

 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the

 * License for the specific language governing permissions and limitations

 * under the License.

 *

 ****************************************************************************/



/****************************************************************************

 * Included Files

 ****************************************************************************/



#include <nuttx/config.h>

#include <stdio.h>

#include <unistd.h>



#include <nuttx/hrtimer.h>



#define HRTIMER_TEST_THREAD_NR (1)

#define HRTIMER_TEST_NR        (1000000)



/****************************************************************************

 * Public Functions

 ****************************************************************************/



static int volatile tcount = 0;

static volatile uint32_t next = 0;

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)

{

  tcount++;

  up_ndelay(hrtimer->expired % (10 * NSEC_PER_USEC));

  return 0;

}



static uint32_t test_callback_background(FAR struct hrtimer_s *hrtimer)

{

  up_ndelay(hrtimer->expired % NSEC_PER_USEC);

  return 0;

}





static void test1(int tid)

{

  hrtimer_t   timer;

  int         count = 0;

  irqstate_t flags;

  spinlock_t lock;



  if (tid == 0)

    {

      hrtimer_init(&timer, test_callback, NULL);

    }

  else

    {

      hrtimer_init(&timer, test_callback_background, NULL);

    }



  while (count++ < HRTIMER_TEST_NR)

    {

      int ret;

      if (tid == 0)

        {

          uint64_t delay = rand() % (10 * NSEC_PER_MSEC);



          /* Simulate the periodical hrtimer.. */



          flags = spin_lock_irqsave(&lock);



          /* Use as periodical timer */



          ret = hrtimer_cancel(&timer);

          ret = hrtimer_start(&timer, 1000, HRTIMER_MODE_REL);



          spin_unlock_irqrestore(&lock, flags);



          up_ndelay(NSEC_PER_MSEC);



          flags = spin_lock_irqsave(&lock);



          ret = hrtimer_cancel(&timer);

          ret = hrtimer_start(&timer, 1000, HRTIMER_MODE_REL);

          spin_unlock_irqrestore(&lock, flags);



          up_ndelay(NSEC_PER_MSEC);



          hrtimer_cancel_sync(&timer); // stucked here????

          printf("???\n");

        }

      else

        {

          /* Simulate the background hrtimer.. */



          uint64_t delay = rand() % (10 * NSEC_PER_MSEC);



          ret = hrtimer_cancel(&timer);

          ret = hrtimer_start(&timer, delay, HRTIMER_MODE_REL);

        }



      UNUSED(ret);

    }

}



static void* test_thread(void *arg)

{

  while (1)

    {

      test1((int)arg);

    }

  return NULL;

}

/****************************************************************************

 * hello_main

 ****************************************************************************/



int main(int argc, FAR char *argv[])

{

  unsigned int   thread_id;

  pthread_attr_t attr;

  pthread_t      pthreads[HRTIMER_TEST_THREAD_NR];



  printf("hrtimer_test start...\n");



  ASSERT(pthread_attr_init(&attr) == 0);



  /* Create wdog test thread */



  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

    {

      ASSERT(pthread_create(&pthreads[thread_id], &attr,

                            test_thread, (void *)thread_id) == 0);

    }



  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

    {

      pthread_join(pthreads[thread_id], NULL);

    }



  ASSERT(pthread_attr_destroy(&attr) == 0);



  printf("hrtimer_test end...\n");

  return 0;

}

this is a minor issue, i just forgot to set hritmer to inactive when period is 0 according to the state machine I designe, it is fixed now:


case HRTIMER_STATE_RUNNING:
            {
              /* Timer callback completed normally */

              if (period > 0)
                {
                  hrtimer->expired += period;
                  hrtimer->state = HRTIMER_STATE_ARMED;
                  RB_INSERT(hrtimer_tree_s, &g_hrtimer_tree,
                            &hrtimer->node);
                }
              else
                {
                  hrtimer->state = HRTIMER_STATE_INACTIVE;
                }

              break;
            }

and your new test passed on rv-virt:smp64 when this update is added:

NuttShell (NSH)
nsh> 
nsh> 
nsh> uname -a
NuttX 0.0.0 02057dc356-dirty Dec 20 2025 00:21:54 risc-v rv-virt
nsh> 
nsh> hello

???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???

(...)

By the way, I have implemented hrtimer as an independent module, so even if it is not perfect, it does not impact existing NuttX functionality. I also want to emphasize that I respect you very much and have no intention of offending you. My concern is simply that spending a significant amount of time re-inventing the wheel—and claiming that a new implementation is superior in every aspect—may not be the most productive approach. In my opinion, building improvements on top of an existing implementation could be a more efficient and constructive direction.

I sincerely apologize for providing incorrect test cases earlier. Due to time constraints, I did not have the opportunity to review them promptly, which unfortunately wasted your time. To address this, I have now provided a reproducible test case that demonstrates the kind of thread interleaving I mentioned.

#include <nuttx/config.h>
#include <stdio.h>

#include <nuttx/hrtimer.h>
#include <sys/resource.h>

/****************************************************************************
 * Name: hrtimer_process
 *
 * Description:
 *   Called from the timer interrupt handler to process expired
 *   high-resolution timers. If a timer has expired, its callback
 *   function will be executed in the context of the timer interrupt.
 *
 * Input Parameters:
 *   now - The current time (nsecs).
 *
 * Returned Value:
 *   None
 ****************************************************************************/

void hrtimer_process(uint64_t now);

/****************************************************************************
 * Inline Functions
 ****************************************************************************/

/****************************************************************************
 * Name: hrtimer_gettime
 *
 * Description:
 *   Get the current high-resolution time in nanoseconds.
 *
 * Returned Value:
 *   Current time in nanoseconds.
 ****************************************************************************/

static inline_function
uint64_t hrtimer_gettime(void)
{
  struct timespec ts;

  /* Get current time from platform-specific timer */

  clock_systime_timespec(&ts);

  /* Convert timespec to nanoseconds */

  return clock_time2nsec(&ts);
}

#define HRTIMER_TEST_THREAD_NR (4)

/****************************************************************************
 * Public Functions
 ****************************************************************************/

static volatile int count = 0;
static hrtimer_t    timer;
static spinlock_t   lock = SP_UNLOCKED;

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)
{
  irqstate_t flags;
  up_ndelay(10 * NSEC_PER_USEC);
  flags = spin_lock_irqsave(&lock);
  count++;
  spin_unlock_irqrestore(&lock, flags);
  return 0;
}

static uint32_t test_callback_back(FAR struct hrtimer_s *hrtimer)
{
  return 0;
}

static void* test_thread(void *arg)
{
  irqstate_t  flags;
  int         tid   = (int)arg;
  int         ret;

  if (tid == 0)
    {
      hrtimer_init(&timer, test_callback, NULL);
    }

  while (1)
    {
      if (tid == 0)
        {
          int tmp_count;
          flags = spin_lock_irqsave(&lock);
          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
          spin_unlock_irqrestore(&lock, flags);

          up_ndelay(1000);

          spin_lock_irqsave(&lock);
          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
          spin_unlock_irqrestore(&lock, flags);

          ret = hrtimer_cancel_sync(&timer); // timer should be fully-cancelled here.
          tmp_count = count; // Read the count
          printf("%d\n", tmp_count);
          ASSERT(tmp_count == count); // This should not assert since timer has been cancelled.
        }
      else
        {
          flags = up_irq_save();
          hrtimer_process(hrtimer_gettime());
          up_irq_restore(flags);
        }
    }

  UNUSED(ret);

  return NULL;
}

/****************************************************************************
 * hello_main
 ****************************************************************************/

int main(int argc, FAR char *argv[])
{
  unsigned int   thread_id;
  pthread_attr_t attr;
  pthread_t      pthreads[HRTIMER_TEST_THREAD_NR];

  printf("hrtimer_test start...\n");

  ASSERT(pthread_attr_init(&attr) == 0);

  /* Create wdog test thread */

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      ASSERT(pthread_create(&pthreads[thread_id], &attr,
                            test_thread, (void *)thread_id) == 0);
    }

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      pthread_join(pthreads[thread_id], NULL);
    }

  ASSERT(pthread_attr_destroy(&attr) == 0);

  printf("hrtimer_test end...\n");
  return 0;
}

In this scenario, a timer that should have been canceled continues to execute. This could lead to very subtle concurrency bugs (such as use-after-free behavior). If such issues make their way into the production environment, they could result in hundreds or even thousands of man-hours being spent on debugging, causing significant losses. Therefore, I find your claim that this is not a problem to be unconvincing.

I support @anchao’s viewpoint that we should design HRTimer from the user's perspective. In #17556, such issues are guaranteed not to occur. As I have mentioned before, there are fundamental flaws in the underlying state machine design of your hrtimer. Minor modifications cannot fix these issues. Therefore, I believe the most efficient approach is to replace your implementation with #17556. Honestly, your attempts, in my view, are good but futile. Because the root of the problem is the violation of ownership invariants. If you insist on finding a correct solution through trial and error, I can only respect your decision and wish you the best.

Regarding the accusation of “reinventing the wheel,” I must counter your argument—

  • Strive for excellence. Today, there are numerous embedded RTOS options available. If NuttX aims to attract more users, it must excel in every aspect and prove itself superior to other RTOSes. Even a 1% optimization in CPU time overhead or a few bytes of memory savings can benefit tens of millions of IoT devices running NuttX, saving substantial energy and hardware costs. In my opinion, every byte and every CPU cycle should be optimized as much as possible. I hope you share this mindset.
  • As educated engineers, we should strive to deliver our best work. However, pursuing excellence demands significant time and effort—if you are familiar with the Pareto principle, you’ll understand that achieving 80% often requires only 1% of the effort, but perfecting the remaining 20% demands 99% of the effort. Please do not assume that those who pursue this level of refinement are wasting their time. This is extremely disrespectful.

Furthermore, as @anchao emphasized, the community is not a testing ground. We should at least ensure basic functional correctness before merging code into the main branch. Please refrain from hastily integrating flawed code like this in the future.

Finally, over the coming weeks, I will focus on merging #17556. I do not have the time to repeatedly point out the fundamental issues in your underlying state machine design. I hope you understand. If you have the time, I would appreciate it if you could review my design for any functional correctness issues or provide suggestions for improvement. Thank you.

qemu-system-riscv32 -semihosting -M virt,aclint=on -cpu rv32 -smp 8 -bios none -nographic -kernel nuttx -s
ABC
NuttShell (NSH) NuttX-12.11.0
nsh>
nsh> hello
hrtimer_test start...
1
2
3
4
5
5
[CPU3] Current Version: NuttX  12.11.0 290d47b4da-dirty Dec 20 2025 12:23:59 risc-v
[CPU3] Assertion failed : at file: :0 task(CPU3): hello process: hello 0x80017690
[CPU3] EPC: 800021cc
[CPU3] A0: 8002b774 A1: 00000000 A2: 00000003 A3: 80033090
[CPU3] A4: 00000002 A5: 00000001 A6: 00000000 A7: 00000001
[CPU3] T0: 0000002e T1: 0000006a T2: 000001ff T3: 0000004c
[CPU3] T4: 00000068 T5: 00000009 T6: 0000002a
[CPU3] S0: 0000018c S1: 00000000 S2: 8002b2d0 S3: 80033090
[CPU3] S4: 00000000 S5: 00000000 S6: 8002d000 S7: 00002088
[CPU3] S8: 00000006 S9: 00000000 S10: 00000000 S11: 00000000
[CPU3] SP: 80033800 FP: 0000018c TP: 00000000 RA: 800021cc
[CPU3] User Stack:
[CPU3]   base: 0x80033178
[CPU3]   size: 00002024
[CPU3]     sp: 0x80033800
[CPU3] 0x800337e0: 8002d000 800227d4 800227d4 80033090 8002b2d0 8002b774 800324e8 8000228c
[CPU3] 0x80033800: 80017690 00000000 00000000 00000002 0000000a 800338ac 00000000 80008346
[CPU3] 0x80033820: 00000000 8001e424 80033090 8002b774 00000000 00000000 00000000 7474754e
[CPU3] 0x80033840: 00000058 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[CPU3] 0x80033860: 00000000 00000000 00000000 00000000 2e323100 302e3131 80025000 8002d000
[CPU3] 0x80033880: 8002d000 393257d4 37346430 61643462 7269642d 44207974 32206365 30322030
[CPU3] 0x800338a0: 31203532 33323a32 0039353a 00000002 8001cec0 8001ce6a 736972de 00762d63
[CPU3] 0x800338c0: 00002088 00000005 800257d4 8001c372 00000000 00000000 00000000 00000000
[CPU3] 0x800338e0: 00000000 8002d404 80025000 8002d000 8002d000 00000005 8002d000 80007bcc
[CPU3] 0x80033900: 00000001 00000000 ee6b2800 80017798 00000000 00000000 00000000 00000000
[CPU3] 0x80033920: 00000000 00000000 00000000 00000000 00000000 00000000 80033090 8001bf0c
[CPU3] 0x80033940: 00000000 00000000 00000000 8001d36a 00000000 00000000 00000000 00000000
[CPU3] 0x80033960: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
nsh>

You miss used hrtimer, the hrtime_process is not an external api to be used in user's task, it should be called from isr, i updated your test code a litttl bit, and the test succeeded:

#include <nuttx/config.h>
#include <stdio.h>

#include <nuttx/hrtimer.h>

#define HRTIMER_TEST_THREAD_NR (4)

/****************************************************************************
 * Public Functions
 ****************************************************************************/

static volatile int count = 0;
static hrtimer_t    timer;
static spinlock_t   lock = SP_UNLOCKED;

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)
{
  irqstate_t flags;
  up_ndelay(10 * NSEC_PER_USEC);
  flags = spin_lock_irqsave(&lock);
  count++;
  spin_unlock_irqrestore(&lock, flags);
  return 0;
}

static uint32_t test_callback_back(FAR struct hrtimer_s *hrtimer)
{
  return 0;
}

static void* test_thread(void *arg)
{
  irqstate_t  flags;
  int         tid   = (int)arg;
  int         ret;

  if (tid == 0)
    {
      hrtimer_init(&timer, test_callback, NULL);
    }

  while (1)
    {
      if (tid == 0)
        {
          int tmp_count;
          flags = spin_lock_irqsave(&lock);
          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
          spin_unlock_irqrestore(&lock, flags);

          up_ndelay(1000);

          spin_lock_irqsave(&lock);
          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
          spin_unlock_irqrestore(&lock, flags);

          ret = hrtimer_cancel_sync(&timer); // timer should be fully-cancelled here.
          tmp_count = count; // Read the count
          printf("%d\n", tmp_count);
          ASSERT(tmp_count == count); // This should not assert since timer has been cancelled.
        }
      else
        {
          flags = up_irq_save();
          up_irq_restore(flags);
        }
    }

  UNUSED(ret);

  return NULL;
}

/****************************************************************************
 * hello_main
 ****************************************************************************/

int main(int argc, FAR char *argv[])
{
  unsigned int   thread_id;
  pthread_attr_t attr;
  pthread_t      pthreads[HRTIMER_TEST_THREAD_NR];

  printf("hrtimer_test start...\n");

  ASSERT(pthread_attr_init(&attr) == 0);

  /* Create wdog test thread */

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      ASSERT(pthread_create(&pthreads[thread_id], &attr,
                            test_thread, (void *)thread_id) == 0);
    }

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      pthread_join(pthreads[thread_id], NULL);
    }

  ASSERT(pthread_attr_destroy(&attr) == 0);

  printf("hrtimer_test end...\n");
  return 0;
}

test log on rv-virt:smp64:

NuttShell (NSH)
nsh> 
nsh> uname -a
NuttX 0.0.0 9bcc6ae543-dirty Dec 20 2025 15:05:57 risc-v rv-virt
nsh> hello
(...)
11322
11324
11326
11328
11330
11332
11334
11336
11338
11340
11342
11344
11346
11348
11350
11352
11354
11356
11358
11360
(...)

@wangchdo wangchdo force-pushed the fix_hrtimer_bug_on_smp branch from 81e074a to 07fbd0b Compare December 20, 2025 07:13
@Fix-Point
Copy link
Contributor

Fix-Point commented Dec 20, 2025

Please pay more respect to parallel programming. It does not look like a simple change can fix this at all.

why this change is not simple? why fixing issue is not respecting you?

I hope that before you claim you fix the issue, you can review your state-machine more carefully. I must say sorry that I have not too much time to review your commit. I am not your z3 SMT solver who can always automatically generate the counter-examples for you. I think it is impossible (or very hard) to allow restart the timer correctly unless refactoring your state-machine design. At least, I can not find a solution that can ensure functional correctness for your design. That is why I recommend refactoring your implementation with #17556.
For example, When the newly-set timer (triggered at t2) while the old timer callback function (triggered at t1) is executing simultaneously. This is a highly likely SMP scenario of thread interleaving. The old timer callback returns next as UINT32_MAX, while the new request callback returns next as 1000. However, since the old timer callback detects that the hrtimer is in the running state, so it sets the next timer to (t2 + UINT32_MAX) instead of (t1 + UINT32_MAX). I suspect that the reason your patch fails to run ostest on rv-virt:smp is also due to this issue.

I already showed that ostest passed on rv-virt:smp64

Timeline:

CPU1 (Timer Callback)                     CPU2 (Set New Timer)

------|--------------------------------------|-------------------------

      |                                      |

      t1: Old timer triggers at t1           |

      |--- Callback starts                   |

      |   hrtimer->state <- running          | [Lock]

      | [Unlock]                             t2: New timer being start(t2)

      |                                      |--- hrtimer_start()

      |                                      |    hrtimer->state <- armed

      |   ...                                | [Unlock]

      |                                      | ...

      |   Callback executes...               | [Lock]

      |                                      |--- New timer triggered

      |                                      |    hrtimer->state <- running

      |                                      | [Unlock]

      |                                      |    Calllback executes....

      |                                      |

      |   Returns next = UINT32_MAX          |

      | [Lock]                               |

      | if (hrtimer->state == running)       |

      |   next expired                       | 

      |    = t2 + UINT32_MAX (wrong!)        | 

      |   hrtimer->state <- armed            | 

      | [Unlock]                             |

      |                                      |  [Lock]

      |                                      |--- hrtimer->state != running

      |                                      |    failed to set next (wrong!)

      |                                      |

------|--------------------------------------|-------------------------

I don’t think the case you pointed out is an issue, as allowing an hrtimer that has just been re-armed to be restarted again is purely a design choice. My initial design allowed re-arming an already armed hrtimer. After considering @xiaoxiang781216’s comments, I have revised the implementation to treat this case as an error.

By the way, here is the another simple test-case that fails that I do not even know why. It stucked at hrtimer_cancel_sync(&timer);. Please fix this problem first.
I must say that I am not a lab rat or your tester. Please ensure the functional correctness of your patches before merging it to upstream. I fully agree with @anchao, who pointed out my problem in #16342 before.

emm ... community is not a testing ground, please do not treating everyone like lab rats.

After that, I am trying my best to ensure my PR is functional correctness. I hope you too.

/****************************************************************************

 * apps/examples/hello/hello_main.c

 *

 * SPDX-License-Identifier: Apache-2.0

 *

 * Licensed to the Apache Software Foundation (ASF) under one or more

 * contributor license agreements.  See the NOTICE file distributed with

 * this work for additional information regarding copyright ownership.  The

 * ASF licenses this file to you under the Apache License, Version 2.0 (the

 * "License"); you may not use this file except in compliance with the

 * License.  You may obtain a copy of the License at

 *

 *   http://www.apache.org/licenses/LICENSE-2.0

 *

 * Unless required by applicable law or agreed to in writing, software

 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT

 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the

 * License for the specific language governing permissions and limitations

 * under the License.

 *

 ****************************************************************************/



/****************************************************************************

 * Included Files

 ****************************************************************************/



#include <nuttx/config.h>

#include <stdio.h>

#include <unistd.h>



#include <nuttx/hrtimer.h>



#define HRTIMER_TEST_THREAD_NR (1)

#define HRTIMER_TEST_NR        (1000000)



/****************************************************************************

 * Public Functions

 ****************************************************************************/



static int volatile tcount = 0;

static volatile uint32_t next = 0;

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)

{

  tcount++;

  up_ndelay(hrtimer->expired % (10 * NSEC_PER_USEC));

  return 0;

}



static uint32_t test_callback_background(FAR struct hrtimer_s *hrtimer)

{

  up_ndelay(hrtimer->expired % NSEC_PER_USEC);

  return 0;

}





static void test1(int tid)

{

  hrtimer_t   timer;

  int         count = 0;

  irqstate_t flags;

  spinlock_t lock;



  if (tid == 0)

    {

      hrtimer_init(&timer, test_callback, NULL);

    }

  else

    {

      hrtimer_init(&timer, test_callback_background, NULL);

    }



  while (count++ < HRTIMER_TEST_NR)

    {

      int ret;

      if (tid == 0)

        {

          uint64_t delay = rand() % (10 * NSEC_PER_MSEC);



          /* Simulate the periodical hrtimer.. */



          flags = spin_lock_irqsave(&lock);



          /* Use as periodical timer */



          ret = hrtimer_cancel(&timer);

          ret = hrtimer_start(&timer, 1000, HRTIMER_MODE_REL);



          spin_unlock_irqrestore(&lock, flags);



          up_ndelay(NSEC_PER_MSEC);



          flags = spin_lock_irqsave(&lock);



          ret = hrtimer_cancel(&timer);

          ret = hrtimer_start(&timer, 1000, HRTIMER_MODE_REL);

          spin_unlock_irqrestore(&lock, flags);



          up_ndelay(NSEC_PER_MSEC);



          hrtimer_cancel_sync(&timer); // stucked here????

          printf("???\n");

        }

      else

        {

          /* Simulate the background hrtimer.. */



          uint64_t delay = rand() % (10 * NSEC_PER_MSEC);



          ret = hrtimer_cancel(&timer);

          ret = hrtimer_start(&timer, delay, HRTIMER_MODE_REL);

        }



      UNUSED(ret);

    }

}



static void* test_thread(void *arg)

{

  while (1)

    {

      test1((int)arg);

    }

  return NULL;

}

/****************************************************************************

 * hello_main

 ****************************************************************************/



int main(int argc, FAR char *argv[])

{

  unsigned int   thread_id;

  pthread_attr_t attr;

  pthread_t      pthreads[HRTIMER_TEST_THREAD_NR];



  printf("hrtimer_test start...\n");



  ASSERT(pthread_attr_init(&attr) == 0);



  /* Create wdog test thread */



  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

    {

      ASSERT(pthread_create(&pthreads[thread_id], &attr,

                            test_thread, (void *)thread_id) == 0);

    }



  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

    {

      pthread_join(pthreads[thread_id], NULL);

    }



  ASSERT(pthread_attr_destroy(&attr) == 0);



  printf("hrtimer_test end...\n");

  return 0;

}

this is a minor issue, i just forgot to set hritmer to inactive when period is 0 according to the state machine I designe, it is fixed now:


case HRTIMER_STATE_RUNNING:
            {
              /* Timer callback completed normally */

              if (period > 0)
                {
                  hrtimer->expired += period;
                  hrtimer->state = HRTIMER_STATE_ARMED;
                  RB_INSERT(hrtimer_tree_s, &g_hrtimer_tree,
                            &hrtimer->node);
                }
              else
                {
                  hrtimer->state = HRTIMER_STATE_INACTIVE;
                }

              break;
            }

and your new test passed on rv-virt:smp64 when this update is added:

NuttShell (NSH)
nsh> 
nsh> 
nsh> uname -a
NuttX 0.0.0 02057dc356-dirty Dec 20 2025 00:21:54 risc-v rv-virt
nsh> 
nsh> hello

???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???
???

(...)

By the way, I have implemented hrtimer as an independent module, so even if it is not perfect, it does not impact existing NuttX functionality. I also want to emphasize that I respect you very much and have no intention of offending you. My concern is simply that spending a significant amount of time re-inventing the wheel—and claiming that a new implementation is superior in every aspect—may not be the most productive approach. In my opinion, building improvements on top of an existing implementation could be a more efficient and constructive direction.

I sincerely apologize for providing incorrect test cases earlier. Due to time constraints, I did not have the opportunity to review them promptly, which unfortunately wasted your time. To address this, I have now provided a reproducible test case that demonstrates the kind of thread interleaving I mentioned.

#include <nuttx/config.h>
#include <stdio.h>

#include <nuttx/hrtimer.h>
#include <sys/resource.h>

/****************************************************************************
 * Name: hrtimer_process
 *
 * Description:
 *   Called from the timer interrupt handler to process expired
 *   high-resolution timers. If a timer has expired, its callback
 *   function will be executed in the context of the timer interrupt.
 *
 * Input Parameters:
 *   now - The current time (nsecs).
 *
 * Returned Value:
 *   None
 ****************************************************************************/

void hrtimer_process(uint64_t now);

/****************************************************************************
 * Inline Functions
 ****************************************************************************/

/****************************************************************************
 * Name: hrtimer_gettime
 *
 * Description:
 *   Get the current high-resolution time in nanoseconds.
 *
 * Returned Value:
 *   Current time in nanoseconds.
 ****************************************************************************/

static inline_function
uint64_t hrtimer_gettime(void)
{
  struct timespec ts;

  /* Get current time from platform-specific timer */

  clock_systime_timespec(&ts);

  /* Convert timespec to nanoseconds */

  return clock_time2nsec(&ts);
}

#define HRTIMER_TEST_THREAD_NR (4)

/****************************************************************************
 * Public Functions
 ****************************************************************************/

static volatile int count = 0;
static hrtimer_t    timer;
static spinlock_t   lock = SP_UNLOCKED;

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)
{
  irqstate_t flags;
  up_ndelay(10 * NSEC_PER_USEC);
  flags = spin_lock_irqsave(&lock);
  count++;
  spin_unlock_irqrestore(&lock, flags);
  return 0;
}

static uint32_t test_callback_back(FAR struct hrtimer_s *hrtimer)
{
  return 0;
}

static void* test_thread(void *arg)
{
  irqstate_t  flags;
  int         tid   = (int)arg;
  int         ret;

  if (tid == 0)
    {
      hrtimer_init(&timer, test_callback, NULL);
    }

  while (1)
    {
      if (tid == 0)
        {
          int tmp_count;
          flags = spin_lock_irqsave(&lock);
          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
          spin_unlock_irqrestore(&lock, flags);

          up_ndelay(1000);

          spin_lock_irqsave(&lock);
          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
          spin_unlock_irqrestore(&lock, flags);

          ret = hrtimer_cancel_sync(&timer); // timer should be fully-cancelled here.
          tmp_count = count; // Read the count
          printf("%d\n", tmp_count);
          ASSERT(tmp_count == count); // This should not assert since timer has been cancelled.
        }
      else
        {
          flags = up_irq_save();
          hrtimer_process(hrtimer_gettime());
          up_irq_restore(flags);
        }
    }

  UNUSED(ret);

  return NULL;
}

/****************************************************************************
 * hello_main
 ****************************************************************************/

int main(int argc, FAR char *argv[])
{
  unsigned int   thread_id;
  pthread_attr_t attr;
  pthread_t      pthreads[HRTIMER_TEST_THREAD_NR];

  printf("hrtimer_test start...\n");

  ASSERT(pthread_attr_init(&attr) == 0);

  /* Create wdog test thread */

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      ASSERT(pthread_create(&pthreads[thread_id], &attr,
                            test_thread, (void *)thread_id) == 0);
    }

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      pthread_join(pthreads[thread_id], NULL);
    }

  ASSERT(pthread_attr_destroy(&attr) == 0);

  printf("hrtimer_test end...\n");
  return 0;
}

In this scenario, a timer that should have been canceled continues to execute. This could lead to very subtle concurrency bugs (such as use-after-free behavior). If such issues make their way into the production environment, they could result in hundreds or even thousands of man-hours being spent on debugging, causing significant losses. Therefore, I find your claim that this is not a problem to be unconvincing.
I support @anchao’s viewpoint that we should design HRTimer from the user's perspective. In #17556, such issues are guaranteed not to occur. As I have mentioned before, there are fundamental flaws in the underlying state machine design of your hrtimer. Minor modifications cannot fix these issues. Therefore, I believe the most efficient approach is to replace your implementation with #17556. Honestly, your attempts, in my view, are good but futile. Because the root of the problem is the violation of ownership invariants. If you insist on finding a correct solution through trial and error, I can only respect your decision and wish you the best.
Regarding the accusation of “reinventing the wheel,” I must counter your argument—

  • Strive for excellence. Today, there are numerous embedded RTOS options available. If NuttX aims to attract more users, it must excel in every aspect and prove itself superior to other RTOSes. Even a 1% optimization in CPU time overhead or a few bytes of memory savings can benefit tens of millions of IoT devices running NuttX, saving substantial energy and hardware costs. In my opinion, every byte and every CPU cycle should be optimized as much as possible. I hope you share this mindset.
  • As educated engineers, we should strive to deliver our best work. However, pursuing excellence demands significant time and effort—if you are familiar with the Pareto principle, you’ll understand that achieving 80% often requires only 1% of the effort, but perfecting the remaining 20% demands 99% of the effort. Please do not assume that those who pursue this level of refinement are wasting their time. This is extremely disrespectful.

Furthermore, as @anchao emphasized, the community is not a testing ground. We should at least ensure basic functional correctness before merging code into the main branch. Please refrain from hastily integrating flawed code like this in the future.
Finally, over the coming weeks, I will focus on merging #17556. I do not have the time to repeatedly point out the fundamental issues in your underlying state machine design. I hope you understand. If you have the time, I would appreciate it if you could review my design for any functional correctness issues or provide suggestions for improvement. Thank you.

qemu-system-riscv32 -semihosting -M virt,aclint=on -cpu rv32 -smp 8 -bios none -nographic -kernel nuttx -s
ABC
NuttShell (NSH) NuttX-12.11.0
nsh>
nsh> hello
hrtimer_test start...
1
2
3
4
5
5
[CPU3] Current Version: NuttX  12.11.0 290d47b4da-dirty Dec 20 2025 12:23:59 risc-v
[CPU3] Assertion failed : at file: :0 task(CPU3): hello process: hello 0x80017690
[CPU3] EPC: 800021cc
[CPU3] A0: 8002b774 A1: 00000000 A2: 00000003 A3: 80033090
[CPU3] A4: 00000002 A5: 00000001 A6: 00000000 A7: 00000001
[CPU3] T0: 0000002e T1: 0000006a T2: 000001ff T3: 0000004c
[CPU3] T4: 00000068 T5: 00000009 T6: 0000002a
[CPU3] S0: 0000018c S1: 00000000 S2: 8002b2d0 S3: 80033090
[CPU3] S4: 00000000 S5: 00000000 S6: 8002d000 S7: 00002088
[CPU3] S8: 00000006 S9: 00000000 S10: 00000000 S11: 00000000
[CPU3] SP: 80033800 FP: 0000018c TP: 00000000 RA: 800021cc
[CPU3] User Stack:
[CPU3]   base: 0x80033178
[CPU3]   size: 00002024
[CPU3]     sp: 0x80033800
[CPU3] 0x800337e0: 8002d000 800227d4 800227d4 80033090 8002b2d0 8002b774 800324e8 8000228c
[CPU3] 0x80033800: 80017690 00000000 00000000 00000002 0000000a 800338ac 00000000 80008346
[CPU3] 0x80033820: 00000000 8001e424 80033090 8002b774 00000000 00000000 00000000 7474754e
[CPU3] 0x80033840: 00000058 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[CPU3] 0x80033860: 00000000 00000000 00000000 00000000 2e323100 302e3131 80025000 8002d000
[CPU3] 0x80033880: 8002d000 393257d4 37346430 61643462 7269642d 44207974 32206365 30322030
[CPU3] 0x800338a0: 31203532 33323a32 0039353a 00000002 8001cec0 8001ce6a 736972de 00762d63
[CPU3] 0x800338c0: 00002088 00000005 800257d4 8001c372 00000000 00000000 00000000 00000000
[CPU3] 0x800338e0: 00000000 8002d404 80025000 8002d000 8002d000 00000005 8002d000 80007bcc
[CPU3] 0x80033900: 00000001 00000000 ee6b2800 80017798 00000000 00000000 00000000 00000000
[CPU3] 0x80033920: 00000000 00000000 00000000 00000000 00000000 00000000 80033090 8001bf0c
[CPU3] 0x80033940: 00000000 00000000 00000000 8001d36a 00000000 00000000 00000000 00000000
[CPU3] 0x80033960: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
nsh>

You miss used hrtimer, the hrtime_process is not an external api to be used in user's task, it should be called from isr, i updated your test code a litttl bit, and the test succeeded:

#include <nuttx/config.h>
#include <stdio.h>

#include <nuttx/hrtimer.h>

#define HRTIMER_TEST_THREAD_NR (4)

/****************************************************************************
 * Public Functions
 ****************************************************************************/

static volatile int count = 0;
static hrtimer_t    timer;
static spinlock_t   lock = SP_UNLOCKED;

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)
{
  irqstate_t flags;
  up_ndelay(10 * NSEC_PER_USEC);
  flags = spin_lock_irqsave(&lock);
  count++;
  spin_unlock_irqrestore(&lock, flags);
  return 0;
}

static uint32_t test_callback_back(FAR struct hrtimer_s *hrtimer)
{
  return 0;
}

static void* test_thread(void *arg)
{
  irqstate_t  flags;
  int         tid   = (int)arg;
  int         ret;

  if (tid == 0)
    {
      hrtimer_init(&timer, test_callback, NULL);
    }

  while (1)
    {
      if (tid == 0)
        {
          int tmp_count;
          flags = spin_lock_irqsave(&lock);
          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
          spin_unlock_irqrestore(&lock, flags);

          up_ndelay(1000);

          spin_lock_irqsave(&lock);
          ret = hrtimer_cancel(&timer);
          ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
          spin_unlock_irqrestore(&lock, flags);

          ret = hrtimer_cancel_sync(&timer); // timer should be fully-cancelled here.
          tmp_count = count; // Read the count
          printf("%d\n", tmp_count);
          ASSERT(tmp_count == count); // This should not assert since timer has been cancelled.
        }
      else
        {
          flags = up_irq_save();
          up_irq_restore(flags);
        }
    }

  UNUSED(ret);

  return NULL;
}

/****************************************************************************
 * hello_main
 ****************************************************************************/

int main(int argc, FAR char *argv[])
{
  unsigned int   thread_id;
  pthread_attr_t attr;
  pthread_t      pthreads[HRTIMER_TEST_THREAD_NR];

  printf("hrtimer_test start...\n");

  ASSERT(pthread_attr_init(&attr) == 0);

  /* Create wdog test thread */

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      ASSERT(pthread_create(&pthreads[thread_id], &attr,
                            test_thread, (void *)thread_id) == 0);
    }

  for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)
    {
      pthread_join(pthreads[thread_id], NULL);
    }

  ASSERT(pthread_attr_destroy(&attr) == 0);

  printf("hrtimer_test end...\n");
  return 0;
}

test log on rv-virt:smp64:

NuttShell (NSH)
nsh> 
nsh> uname -a
NuttX 0.0.0 9bcc6ae543-dirty Dec 20 2025 15:05:57 risc-v rv-virt
nsh> hello
(...)
11322
11324
11326
11328
11330
11332
11334
11336
11338
11340
11342
11344
11346
11348
11350
11352
11354
11356
11358
11360
(...)

I am afraid that this does not appear to be an issue caused by hrtimer_process(hrtimer_gettime());. The call is only intended to make it easier and faster to reproduce the thread interleaving scenario I described earlier.
When you remove the ASSERT on line 108, the program executes normally, which indicates that calling hrtimer_process here is OK.
This test case is designed to demonstrate that the following interleaving scenario can occur during actual concurrency expiration handling:

  1. Core 1 is executing the old timer callback function.
  2. Core 0 restarts a new timer.
  3. Core 2 executes timer expiration, and it begins executing the new timer callback function.
  4. Core 1 finishes its old callback execution and sets hrtimer->state to INACTIVE.
  5. Core 0 attempts to cancel the timer, finds hrtimer->state == INACTIVE, and assumes the timer has completed. However, Core 2 is still executing the new timer callback, which may lead to a use-after-free issue.

The root-cause is the violation of the ownership invariants, which allows two different owners modify the same timer. In my opinion, this problem cannot be solved unless the entire state machine is redesigned. That is the one of the reason why I recommend the functional correct implementation #17556.

Please don’t take this the wrong way—I truly respect the time and effort you’ve invested in the design and implementation of this HRtimer, and I also greatly appreciate some of the changes you've made to the scheduler.
The reason I suggest adopting #17556 implementation is simply to save us both time and allow us to focus on more meaningful optimizations. I hope you could understand.

The issue you're facing now is one I've encountered before. To be honest, I don't mean to see someone stumble over the same stone again. My solution was to carefully design the state machine with a clear ownership transfer path—exactly as you see in the #17556.

Non-blocking Timer Restart: To address the issue in Linux where restarting a timer must wait for an already-started callback function to finish, which reduces the real-time performance, the new HRTimer implements a non-blocking timer restart mechanism. This mechanism reuses the last bit of the hazard pointer to mark whether the thread executing the callback has lost write ownership of the HRTimer object. After hrtimer_async_cancel is called, other threads executing callbacks will lose write ownership of the HRTimer (though their callback functions may still be executing). This means the HRTimer can be restarted and repurposed for other callbacks without waiting for the callback function to complete. However, note that the callback function might still be executing, requiring users to consider this concurrency and implement proper synchronization mechanisms within their callback functions. To explicitly remind users of this concurrency, an HRTimer whose callback function has not yet completed execution must be restarted using hrtimer_restart. This function relaxes the state checks on the HRTimer, allowing a timer with the callback running to be started.

Since there is already a decent solution to this problem, why don't we relax and enjoy the weekend instead of spending more time stuck on the same problem?

@wangchdo
Copy link
Contributor Author

wangchdo commented Dec 20, 2025

Please pay more respect to parallel programming. It does not look like a simple change can fix this at all.

why this change is not simple? why fixing issue is not respecting you?

I hope that before you claim you fix the issue, you can review your state-machine more carefully. I must say sorry that I have not too much time to review your commit. I am not your z3 SMT solver who can always automatically generate the counter-examples for you. I think it is impossible (or very hard) to allow restart the timer correctly unless refactoring your state-machine design. At least, I can not find a solution that can ensure functional correctness for your design. That is why I recommend refactoring your implementation with #17556.

For example, When the newly-set timer (triggered at t2) while the old timer callback function (triggered at t1) is executing simultaneously. This is a highly likely SMP scenario of thread interleaving. The old timer callback returns next as UINT32_MAX, while the new request callback returns next as 1000. However, since the old timer callback detects that the hrtimer is in the running state, so it sets the next timer to (t2 + UINT32_MAX) instead of (t1 + UINT32_MAX). I suspect that the reason your patch fails to run ostest on rv-virt:smp is also due to this issue.

I already showed that ostest passed on rv-virt:smp64

Timeline:

CPU1 (Timer Callback) CPU2 (Set New Timer)

------|--------------------------------------|-------------------------

  |                                      |
  t1: Old timer triggers at t1           |
  |--- Callback starts                   |
  |   hrtimer->state <- running          | [Lock]
  | [Unlock]                             t2: New timer being start(t2)
  |                                      |--- hrtimer_start()
  |                                      |    hrtimer->state <- armed
  |   ...                                | [Unlock]
  |                                      | ...
  |   Callback executes...               | [Lock]
  |                                      |--- New timer triggered
  |                                      |    hrtimer->state <- running
  |                                      | [Unlock]
  |                                      |    Calllback executes....
  |                                      |
  |   Returns next = UINT32_MAX          |
  | [Lock]                               |
  | if (hrtimer->state == running)       |
  |   next expired                       | 
  |    = t2 + UINT32_MAX (wrong!)        | 
  |   hrtimer->state <- armed            | 
  | [Unlock]                             |
  |                                      |  [Lock]
  |                                      |--- hrtimer->state != running
  |                                      |    failed to set next (wrong!)
  |                                      |

------|--------------------------------------|-------------------------

I don’t think the case you pointed out is an issue, as allowing an hrtimer that has just been re-armed to be restarted again is purely a design choice. My initial design allowed re-arming an already armed hrtimer. After considering @xiaoxiang781216’s comments, I have revised the implementation to treat this case as an error.

By the way, here is the another simple test-case that fails that I do not even know why. It stucked at hrtimer_cancel_sync(&timer);. Please fix this problem first.

I must say that I am not a lab rat or your tester. Please ensure the functional correctness of your patches before merging it to upstream. I fully agree with @anchao, who pointed out my problem in #16342 before.

emm ... community is not a testing ground, please do not treating everyone like lab rats.

After that, I am trying my best to ensure my PR is functional correctness. I hope you too.

/****************************************************************************

  • apps/examples/hello/hello_main.c
  • SPDX-License-Identifier: Apache-2.0
  • Licensed to the Apache Software Foundation (ASF) under one or more
  • contributor license agreements. See the NOTICE file distributed with
  • this work for additional information regarding copyright ownership. The
  • ASF licenses this file to you under the Apache License, Version 2.0 (the
  • "License"); you may not use this file except in compliance with the
  • License. You may obtain a copy of the License at
  • Unless required by applicable law or agreed to in writing, software
  • distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
  • WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
  • License for the specific language governing permissions and limitations
  • under the License.

****************************************************************************/

/****************************************************************************

  • Included Files

****************************************************************************/

#include <nuttx/config.h>

#include <stdio.h>

#include <unistd.h>

#include <nuttx/hrtimer.h>

#define HRTIMER_TEST_THREAD_NR (1)

#define HRTIMER_TEST_NR (1000000)

/****************************************************************************

  • Public Functions

****************************************************************************/

static int volatile tcount = 0;

static volatile uint32_t next = 0;

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)

{

tcount++;

up_ndelay(hrtimer->expired % (10 * NSEC_PER_USEC));

return 0;

}

static uint32_t test_callback_background(FAR struct hrtimer_s *hrtimer)

{

up_ndelay(hrtimer->expired % NSEC_PER_USEC);

return 0;

}

static void test1(int tid)

{

hrtimer_t timer;

int count = 0;

irqstate_t flags;

spinlock_t lock;

if (tid == 0)

{
  hrtimer_init(&timer, test_callback, NULL);
}

else

{
  hrtimer_init(&timer, test_callback_background, NULL);
}

while (count++ < HRTIMER_TEST_NR)

{
  int ret;
  if (tid == 0)
    {
      uint64_t delay = rand() % (10 * NSEC_PER_MSEC);
      /* Simulate the periodical hrtimer.. */
      flags = spin_lock_irqsave(&lock);
      /* Use as periodical timer */
      ret = hrtimer_cancel(&timer);
      ret = hrtimer_start(&timer, 1000, HRTIMER_MODE_REL);
      spin_unlock_irqrestore(&lock, flags);
      up_ndelay(NSEC_PER_MSEC);
      flags = spin_lock_irqsave(&lock);
      ret = hrtimer_cancel(&timer);
      ret = hrtimer_start(&timer, 1000, HRTIMER_MODE_REL);
      spin_unlock_irqrestore(&lock, flags);
      up_ndelay(NSEC_PER_MSEC);
      hrtimer_cancel_sync(&timer); // stucked here????
      printf("???\n");
    }
  else
    {
      /* Simulate the background hrtimer.. */
      uint64_t delay = rand() % (10 * NSEC_PER_MSEC);
      ret = hrtimer_cancel(&timer);
      ret = hrtimer_start(&timer, delay, HRTIMER_MODE_REL);
    }
  UNUSED(ret);
}

}

static void* test_thread(void *arg)

{

while (1)

{
  test1((int)arg);
}

return NULL;

}

/****************************************************************************

  • hello_main

****************************************************************************/

int main(int argc, FAR char *argv[])

{

unsigned int thread_id;

pthread_attr_t attr;

pthread_t pthreads[HRTIMER_TEST_THREAD_NR];

printf("hrtimer_test start...\n");

ASSERT(pthread_attr_init(&attr) == 0);

/* Create wdog test thread */

for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

{
  ASSERT(pthread_create(&pthreads[thread_id], &attr,
                        test_thread, (void *)thread_id) == 0);
}

for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

{
  pthread_join(pthreads[thread_id], NULL);
}

ASSERT(pthread_attr_destroy(&attr) == 0);

printf("hrtimer_test end...\n");

return 0;

}

this is a minor issue, i just forgot to set hritmer to inactive when period is 0 according to the state machine I designe, it is fixed now:

case HRTIMER_STATE_RUNNING:

        {
          /* Timer callback completed normally */
          if (period > 0)
            {
              hrtimer->expired += period;
              hrtimer->state = HRTIMER_STATE_ARMED;
              RB_INSERT(hrtimer_tree_s, &g_hrtimer_tree,
                        &hrtimer->node);
            }
          else
            {
              hrtimer->state = HRTIMER_STATE_INACTIVE;
            }
          break;
        }

and your new test passed on rv-virt:smp64 when this update is added:

NuttShell (NSH)

nsh>

nsh>

nsh> uname -a

NuttX 0.0.0 02057dc356-dirty Dec 20 2025 00:21:54 risc-v rv-virt

nsh>

nsh> hello

???

???

???

???

???

???

???

???

???

???

???

???

???

???

???

???

???

???

???

???

???

???

???

(...)

By the way, I have implemented hrtimer as an independent module, so even if it is not perfect, it does not impact existing NuttX functionality. I also want to emphasize that I respect you very much and have no intention of offending you. My concern is simply that spending a significant amount of time re-inventing the wheel—and claiming that a new implementation is superior in every aspect—may not be the most productive approach. In my opinion, building improvements on top of an existing implementation could be a more efficient and constructive direction.

I sincerely apologize for providing incorrect test cases earlier. Due to time constraints, I did not have the opportunity to review them promptly, which unfortunately wasted your time. To address this, I have now provided a reproducible test case that demonstrates the kind of thread interleaving I mentioned.

#include <nuttx/config.h>

#include <stdio.h>

#include <nuttx/hrtimer.h>

#include <sys/resource.h>

/****************************************************************************

  • Name: hrtimer_process
  • Description:
  • Called from the timer interrupt handler to process expired
  • high-resolution timers. If a timer has expired, its callback
  • function will be executed in the context of the timer interrupt.
  • Input Parameters:
  • now - The current time (nsecs).
  • Returned Value:
  • None

****************************************************************************/

void hrtimer_process(uint64_t now);

/****************************************************************************

  • Inline Functions

****************************************************************************/

/****************************************************************************

  • Name: hrtimer_gettime
  • Description:
  • Get the current high-resolution time in nanoseconds.
  • Returned Value:
  • Current time in nanoseconds.

****************************************************************************/

static inline_function

uint64_t hrtimer_gettime(void)

{

struct timespec ts;

/* Get current time from platform-specific timer */

clock_systime_timespec(&ts);

/* Convert timespec to nanoseconds */

return clock_time2nsec(&ts);

}

#define HRTIMER_TEST_THREAD_NR (4)

/****************************************************************************

  • Public Functions

****************************************************************************/

static volatile int count = 0;

static hrtimer_t timer;

static spinlock_t lock = SP_UNLOCKED;

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)

{

irqstate_t flags;

up_ndelay(10 * NSEC_PER_USEC);

flags = spin_lock_irqsave(&lock);

count++;

spin_unlock_irqrestore(&lock, flags);

return 0;

}

static uint32_t test_callback_back(FAR struct hrtimer_s *hrtimer)

{

return 0;

}

static void* test_thread(void *arg)

{

irqstate_t flags;

int tid = (int)arg;

int ret;

if (tid == 0)

{
  hrtimer_init(&timer, test_callback, NULL);
}

while (1)

{
  if (tid == 0)
    {
      int tmp_count;
      flags = spin_lock_irqsave(&lock);
      ret = hrtimer_cancel(&timer);
      ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
      spin_unlock_irqrestore(&lock, flags);
      up_ndelay(1000);
      spin_lock_irqsave(&lock);
      ret = hrtimer_cancel(&timer);
      ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
      spin_unlock_irqrestore(&lock, flags);
      ret = hrtimer_cancel_sync(&timer); // timer should be fully-cancelled here.
      tmp_count = count; // Read the count
      printf("%d\n", tmp_count);
      ASSERT(tmp_count == count); // This should not assert since timer has been cancelled.
    }
  else
    {
      flags = up_irq_save();
      hrtimer_process(hrtimer_gettime());
      up_irq_restore(flags);
    }
}

UNUSED(ret);

return NULL;

}

/****************************************************************************

  • hello_main

****************************************************************************/

int main(int argc, FAR char *argv[])

{

unsigned int thread_id;

pthread_attr_t attr;

pthread_t pthreads[HRTIMER_TEST_THREAD_NR];

printf("hrtimer_test start...\n");

ASSERT(pthread_attr_init(&attr) == 0);

/* Create wdog test thread */

for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

{
  ASSERT(pthread_create(&pthreads[thread_id], &attr,
                        test_thread, (void *)thread_id) == 0);
}

for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

{
  pthread_join(pthreads[thread_id], NULL);
}

ASSERT(pthread_attr_destroy(&attr) == 0);

printf("hrtimer_test end...\n");

return 0;

}

In this scenario, a timer that should have been canceled continues to execute. This could lead to very subtle concurrency bugs (such as use-after-free behavior). If such issues make their way into the production environment, they could result in hundreds or even thousands of man-hours being spent on debugging, causing significant losses. Therefore, I find your claim that this is not a problem to be unconvincing.

I support @anchao’s viewpoint that we should design HRTimer from the user's perspective. In #17556, such issues are guaranteed not to occur. As I have mentioned before, there are fundamental flaws in the underlying state machine design of your hrtimer. Minor modifications cannot fix these issues. Therefore, I believe the most efficient approach is to replace your implementation with #17556. Honestly, your attempts, in my view, are good but futile. Because the root of the problem is the violation of ownership invariants. If you insist on finding a correct solution through trial and error, I can only respect your decision and wish you the best.

Regarding the accusation of “reinventing the wheel,” I must counter your argument—

  • Strive for excellence. Today, there are numerous embedded RTOS options available. If NuttX aims to attract more users, it must excel in every aspect and prove itself superior to other RTOSes. Even a 1% optimization in CPU time overhead or a few bytes of memory savings can benefit tens of millions of IoT devices running NuttX, saving substantial energy and hardware costs. In my opinion, every byte and every CPU cycle should be optimized as much as possible. I hope you share this mindset.
  • As educated engineers, we should strive to deliver our best work. However, pursuing excellence demands significant time and effort—if you are familiar with the Pareto principle, you’ll understand that achieving 80% often requires only 1% of the effort, but perfecting the remaining 20% demands 99% of the effort. Please do not assume that those who pursue this level of refinement are wasting their time. This is extremely disrespectful.

Furthermore, as @anchao emphasized, the community is not a testing ground. We should at least ensure basic functional correctness before merging code into the main branch. Please refrain from hastily integrating flawed code like this in the future.

Finally, over the coming weeks, I will focus on merging #17556. I do not have the time to repeatedly point out the fundamental issues in your underlying state machine design. I hope you understand. If you have the time, I would appreciate it if you could review my design for any functional correctness issues or provide suggestions for improvement. Thank you.

qemu-system-riscv32 -semihosting -M virt,aclint=on -cpu rv32 -smp 8 -bios none -nographic -kernel nuttx -s

ABC

NuttShell (NSH) NuttX-12.11.0

nsh>

nsh> hello

hrtimer_test start...

1

2

3

4

5

5

[CPU3] Current Version: NuttX 12.11.0 290d47b4da-dirty Dec 20 2025 12:23:59 risc-v

[CPU3] Assertion failed : at file: :0 task(CPU3): hello process: hello 0x80017690

[CPU3] EPC: 800021cc

[CPU3] A0: 8002b774 A1: 00000000 A2: 00000003 A3: 80033090

[CPU3] A4: 00000002 A5: 00000001 A6: 00000000 A7: 00000001

[CPU3] T0: 0000002e T1: 0000006a T2: 000001ff T3: 0000004c

[CPU3] T4: 00000068 T5: 00000009 T6: 0000002a

[CPU3] S0: 0000018c S1: 00000000 S2: 8002b2d0 S3: 80033090

[CPU3] S4: 00000000 S5: 00000000 S6: 8002d000 S7: 00002088

[CPU3] S8: 00000006 S9: 00000000 S10: 00000000 S11: 00000000

[CPU3] SP: 80033800 FP: 0000018c TP: 00000000 RA: 800021cc

[CPU3] User Stack:

[CPU3] base: 0x80033178

[CPU3] size: 00002024

[CPU3] sp: 0x80033800

[CPU3] 0x800337e0: 8002d000 800227d 800227d 80033090 8002b2d0 8002b774 800324e8 8000228c

[CPU3] 0x80033800: 80017690 00000000 00000000 00000002 0000000a 800338ac 00000000 80008346

[CPU3] 0x80033820: 00000000 8001e424 80033090 8002b774 00000000 00000000 00000000 7474754e

[CPU3] 0x80033840: 00000058 00000000 00000000 00000000 00000000 00000000 00000000 00000000

[CPU3] 0x80033860: 00000000 00000000 00000000 00000000 2e323100 302e3131 80025000 8002d000

[CPU3] 0x80033880: 8002d000 393257d4 37346430 61643462 7269642d 44207974 32206365 30322030

[CPU3] 0x800338a0: 31203532 33323a32 0039353a 00000002 8001cec0 8001ce6a 736972de 00762d63

[CPU3] 0x800338c0: 00002088 00000005 800257d4 8001c372 00000000 00000000 00000000 00000000

[CPU3] 0x800338e0: 00000000 8002d404 80025000 8002d000 8002d000 00000005 8002d000 80007bcc

[CPU3] 0x80033900: 00000001 00000000 ee6b2800 80017798 00000000 00000000 00000000 00000000

[CPU3] 0x80033920: 00000000 00000000 00000000 00000000 00000000 00000000 80033090 8001bf0c

[CPU3] 0x80033940: 00000000 00000000 00000000 8001d36a 00000000 00000000 00000000 00000000

[CPU3] 0x80033960: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

nsh>

You miss used hrtimer, the hrtime_process is not an external api to be used in user's task, it should be called from isr, i updated your test code a litttl bit, and the test succeeded:

#include <nuttx/config.h>

#include <stdio.h>

#include <nuttx/hrtimer.h>

#define HRTIMER_TEST_THREAD_NR (4)

/****************************************************************************

  • Public Functions

****************************************************************************/

static volatile int count = 0;

static hrtimer_t timer;

static spinlock_t lock = SP_UNLOCKED;

static uint32_t test_callback(FAR struct hrtimer_s *hrtimer)

{

irqstate_t flags;

up_ndelay(10 * NSEC_PER_USEC);

flags = spin_lock_irqsave(&lock);

count++;

spin_unlock_irqrestore(&lock, flags);

return 0;

}

static uint32_t test_callback_back(FAR struct hrtimer_s *hrtimer)

{

return 0;

}

static void* test_thread(void *arg)

{

irqstate_t flags;

int tid = (int)arg;

int ret;

if (tid == 0)

{
  hrtimer_init(&timer, test_callback, NULL);
}

while (1)

{
  if (tid == 0)
    {
      int tmp_count;
      flags = spin_lock_irqsave(&lock);
      ret = hrtimer_cancel(&timer);
      ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
      spin_unlock_irqrestore(&lock, flags);
      up_ndelay(1000);
      spin_lock_irqsave(&lock);
      ret = hrtimer_cancel(&timer);
      ret = hrtimer_start(&timer, 0, HRTIMER_MODE_REL);
      spin_unlock_irqrestore(&lock, flags);
      ret = hrtimer_cancel_sync(&timer); // timer should be fully-cancelled here.
      tmp_count = count; // Read the count
      printf("%d\n", tmp_count);
      ASSERT(tmp_count == count); // This should not assert since timer has been cancelled.
    }
  else
    {
      flags = up_irq_save();
      up_irq_restore(flags);
    }
}

UNUSED(ret);

return NULL;

}

/****************************************************************************

  • hello_main

****************************************************************************/

int main(int argc, FAR char *argv[])

{

unsigned int thread_id;

pthread_attr_t attr;

pthread_t pthreads[HRTIMER_TEST_THREAD_NR];

printf("hrtimer_test start...\n");

ASSERT(pthread_attr_init(&attr) == 0);

/* Create wdog test thread */

for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

{
  ASSERT(pthread_create(&pthreads[thread_id], &attr,
                        test_thread, (void *)thread_id) == 0);
}

for (thread_id = 0; thread_id < HRTIMER_TEST_THREAD_NR; thread_id++)

{
  pthread_join(pthreads[thread_id], NULL);
}

ASSERT(pthread_attr_destroy(&attr) == 0);

printf("hrtimer_test end...\n");

return 0;

}

test log on rv-virt:smp64:

NuttShell (NSH)

nsh>

nsh> uname -a

NuttX 0.0.0 9bcc6ae-dirty Dec 20 2025 15:05:57 risc-v rv-virt

nsh> hello

(...)

11322

11324

11326

11328

11330

11332

11334

11336

11338

11340

11342

11344

11346

11348

11350

11352

11354

11356

11358

11360

(...)

I am afraid that this does not appear to be an issue caused by hrtimer_process(hrtimer_gettime());. The call is only intended to make it easier and faster to reproduce the thread interleaving scenario I described earlier.

When you remove the ASSERT on line 108, the program executes normally, which indicates that calling hrtimer_process here is OK.

This test case is designed to demonstrate that the following interleaving scenario can occur during actual concurrency expiration handling:

  1. Core 1 is executing the old timer callback function.

  2. Core 0 restarts a new timer.

  3. Core 2 executes timer expiration, and it begins executing the new timer callback function.

  4. Core 1 finishes its old callback execution and sets hrtimer->state to INACTIVE.

  5. Core 0 attempts to cancel the timer, finds hrtimer->state == INACTIVE, and assumes the timer has completed. However, Core 2 is still executing the new timer callback, which may lead to a use-after-free issue.

The root-cause is the violation of the ownership invariants, which allows two different owners modify the same timer. In my opinion, this problem cannot be solved unless the entire state machine is redesigned. That is the one of the reason why I recommend the functional correct implementation #17556.

Please don’t take this the wrong way—I truly respect the time and effort you’ve invested in the design and implementation of this HRtimer, and I also greatly appreciate some of the changes you've made to the scheduler.

The reason I suggest adopting #17556 implementation is simply to save us both time and allow us to focus on more meaningful optimizations. I hope you could understand.

The issue you're facing now is one I've encountered before. To be honest, I don't mean to see someone stumble over the same stone again. My solution was to carefully design the state machine with a clear ownership transfer path—exactly as you see in the #17556.

Non-blocking Timer Restart: To address the issue in Linux where restarting a timer must wait for an already-started callback function to finish, which reduces the real-time performance, the new HRTimer implements a non-blocking timer restart mechanism. This mechanism reuses the last bit of the hazard pointer to mark whether the thread executing the callback has lost write ownership of the HRTimer object. After hrtimer_async_cancel is called, other threads executing callbacks will lose write ownership of the HRTimer (though their callback functions may still be executing). This means the HRTimer can be restarted and repurposed for other callbacks without waiting for the callback function to complete. However, note that the callback function might still be executing, requiring users to consider this concurrency and implement proper synchronization mechanisms within their callback functions. To explicitly remind users of this concurrency, an HRTimer whose callback function has not yet completed execution must be restarted using hrtimer_restart. This function relaxes the state checks on the HRTimer, allowing a timer with the callback running to be started.

Since there is already a decent solution to this problem, why don't we relax and enjoy the weekend instead of spending more time stuck on the same problem?

Do you check my new update of the state machine? I don't think the case you mentioned is an issue in current update.

I think SMP case can be solved by hrtimer state-machine update, and state-machine update is not major, since it is only part of the hrtimer module. And the hrtimer acting as a independent module, it allows being improved independently for cases that were not considered at the start.

Allow running/armed hrtimer to be restarted to
fix hrtimer bug: apache#17567

Signed-off-by: Chengdong Wang <wangchengdong@lixiang.com>
  Update the hrtimer documentation to describe the hrtimer state machine,
  which is introduced to handle safe cancellation and execution in SMP
  environments.

Signed-off-by: Chengdong Wang <wangchengdong@lixiang.com>
@wangchdo wangchdo force-pushed the fix_hrtimer_bug_on_smp branch from 07fbd0b to b5a9df7 Compare December 20, 2025 11:21
@github-actions github-actions bot added the Size: M The size of the change in this PR is medium label Dec 20, 2025
@wangchdo
Copy link
Contributor Author

This PR has been incorporated into #17573.
I am closing this one so reviewers can focus their efforts on that PR instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Documentation Improvements or additions to documentation Area: OS Components OS Components issues Size: M The size of the change in this PR is medium Size: S The size of the change in this PR is small

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants