How Modern Programming Languages Implement Threading, and how I did it

Exploring the latest in technology and innovation

S
Shrehan Raj Singh
Dec 24, 2025

Threads are some of the hardest topics to master, both in terms of usage and implementation. Writing code that is thread safe has always been tedious hard to debug. A programming language with good design of thread management has always been praised like Go with their Goroutines. I too had my fair share of nightmares trying to implement threads. Few of the problems I encountered were:

  • Tracking down undefined behavior like random segmentation faults.
  • Allocating too much to cause memory leaks.
  • Freeing to early to cause dangling pointers.
    Many more I would not like to share due to embarrassment

What are threads? Threads are small batches of programs that run independent and concurrent to the main program and any other thread. They have their own stack, program counter and behave like a separate program. For example, when you are playing a game, one thread watches for keyboard input, another might be busy in maintaining the connection to the server, another might be busy updating the frames. All these processes are independent, and need not run at the expense of others: they run concurrently/parallel.

So how exactly did I go about implementing threads in Sunflower, a programming language of my own design.
Firstly, I had no idea about the paradigm modern programming languages adopt in order to implement threads, so I just went about creating OS level threads for each thread created in the frontend (Sunflower).
This is not too bad, until I hit the first UB (Undefined Behavior) of segmentation fault, because most computers cannot allocate 10,000 threads (and they shouldn't).
So I started learning how other programming languages, particularly Go and Python, manage the problem of allocating a million dollars to a bakery. How Go solves the problem? Go uses a concept of N:M mapping, meaning it maps N jobs to M threads. The M comes from hardware specifications of the computer the program is running on (how many threads it can support). It uses an allocation algorithm (for example Round Robin) to ensure distributed job distribution between all workers (M).

How I solved the problem? I too use M workers who either work or sleep (not waste CPU cycles). I allocate M threads who all sleep normally (do not do anything). They are all conditioned by a condition_variable, which acts like a producer. Whenever a threads comes in the pool queue, a message is sent to all workers that a thread needs to be executed, and any worker free at the moment picks it up and executes it. This is different from Go's Round Robin system since task allocation is truly random, therefore, trivially distributed.

A simple explanation of what Sunflower does now

An initializer method that initializes worker threads

static std::vector<std::thread> v_workers;
SF_API void
init_runtime_threads ()
{
  size_t n = std::thread::hardware_concurrency ();
 
  if (!n)
    n = 4;
 
  for (size_t i = 0; i < n; i++)
    v_workers.emplace_back (worker_loop);
}

v_workers store all the working threads (the M) that will watch for new jobs and pick them up and execute them. Implementation of worker_loop

static void
worker_loop ()
{
  while (1)
    {
      ThreadHandle *th = nullptr;
 
      {
        std::unique_lock<std::mutex> lock (wl_mutex);
        t_cv.wait (lock, [] { return shutting_down || !q_jobs.empty (); });
 
        if (shutting_down && q_jobs.empty ())
          return;
 
        th = q_jobs.front ();
        q_jobs.pop ();
      }
 
      if (th == nullptr)
        return;
 
      /* call sunflower function */
 
      th->get_done () = true;
 
      /**
       * closed before thread finished
       * user likely called .close () before .join ()
       * detached thread
       */
      if (th->get_is_closed ())
        {
          size_t id = th->get_id ();
 
          {
            std::lock_guard<std::mutex> lock (cre_mutex);
            delete v_handles[id];
            v_handles[id] = nullptr;
            idx_avl.push_back (id);
          }
        }
 
      /* local cleanup */
    }
}

Simple explanation of worker_loop The line

std::unique_lock<std::mutex> lock (wl_mutex);
t_cv.wait (lock, [] { return shutting_down || !q_jobs.empty (); });

means to lock the worker until [] {return shutting_down || !q_jobs.empty ();} executes to true. std::unique_lock is a more relaxed lock than std::lock_guard in the sense that it allows mutex to be unlocked at will. This is much more efficient than an infinite loop that blocks until queue is non empty.

while (!q_jobs.empty ()); /* blocks, wastes CPU cycles */

When a job is received, it evaluates it, and immediately releases the reference of return value of the function. This is by design because storing references lead to memory leaks and it becomes difficult to manage return value. I am currently looking at ways to overcome this, but conventionally most thread programs are independent in the sense that main worker has less use of return values of thread functions. In traditional C, worker functions of threads cannot return values (thread accepts void (*)(void *) function signature)

Now, moving on to real world implementation with the wit we gathered along the way

State variables:

static Vec<ThreadHandle *> v_handles; /* stores indexes of worker jobs */
static Vec<size_t> idx_avl;           /* indices available */
 
static std::mutex cre_mutex; /* mutex for create () and close () */
static std::mutex wl_mutex;  /* mutex for worker_loop */
static std::mutex thr_mutex; /* threadhandle->run mutex */
 
static std::condition_variable t_cv;
static std::queue<ThreadHandle *> q_jobs;
static std::vector<std::thread> v_workers;
static bool shutting_down = false;

ThreadHandle implementation (summarized)

class ThreadHandle
{
private:
  Object *name = nullptr;
  Object *args = nullptr;
  Module *mod = nullptr;
  bool done = false;
  bool is_closed = false;
  size_t id = 0;
 
public:
  ThreadHandle (Object *_Name, Object *_Args, Module *_Mod)
      : name{ _Name }, args{ _Args }, mod{ _Mod }
  {
    IR (name);
    IR (args);
  }
 
  Module *&get_mod ();
  Object *&get_name ();
  Object *&get_args ();
  inline bool &get_done ();
  inline size_t &get_id ();
  inline bool &get_is_closed ();
  void run ();
 
  ~ThreadHandle ()
  {
    DR (name);
    DR (args);
  }
};

Pushing jobs to queue

void
ThreadHandle::run ()
{
  {
    std::lock_guard<std::mutex> lock (thr_mutex);
    q_jobs.push (this);
  }
  t_cv.notify_one ();
}

Implementing Sunflower API We need to implement the following functions

  • create (...) Create a thread
  • run () Run the thread in a detached state
  • join () Block until thread is complete
  • join_all () Block until all threads are complete
  • close () Close the thread (detach if not finished) Sunflower by default does not provide the power to kill a running thread process, since that can lead to undefined behavior.

We will implement them in order: create

SF_API Object *
create (Module *mod)
{
  /**
   * Since we are writing to v_handles,
   * we need a mutex lock
   */
  std::lock_guard<std::mutex> lock (cre_mutex);
 
  Object *o_fname = mod->get_variable ("fname");
  Object *o_fargs = mod->get_variable ("fargs");
 
  assert (o_fname->get_type () == ObjectType::FuncObject);
  assert (o_fargs->get_type () == ObjectType::ArrayObj);
 
  ThreadHandle *th = new ThreadHandle (
      o_fname, o_fargs,
      static_cast<FunctionObject *> (o_fname)->get_v ()->get_parent ());
 
  /**
   * check if any index is available
   * Index availability means the worker
   * job at that index has completed.
   * We can reuse that vector position
   */
  size_t idx;
  if (idx_avl.get_size ())
    {
      size_t p = idx_avl.pop_back ();
      v_handles[p] = th;
      idx = p;
    }
  else
    {
      idx = v_handles.get_size ();
      v_handles.push_back (th);
    }
 
  th->get_id () = idx;
 
  Object *ret = static_cast<Object *> (new ConstantObject (
      static_cast<Constant *> (new IntegerConstant (static_cast<int> (idx)))));
 
  IR (ret);
  return ret;
}

run

SF_API Object *
run (Module *mod)
{
  Object *o_id = mod->get_variable ("id");
  assert (OBJ_IS_INT (o_id));
 
  size_t id = static_cast<size_t> (
      static_cast<IntegerConstant *> (
          static_cast<ConstantObject *> (o_id)->get_c ().get ())
          ->get_value ());
 
  //   std::cout << id << '\t' << v_handles.get_size () << '\n';
  assert (id < v_handles.get_size ());
  v_handles[id]->run ();
 
  Object *ret = static_cast<Object *> (
      new ConstantObject (static_cast<Constant *> (new NoneConstant ())));
 
  IR (ret);
  return ret;
}

join

SF_API Object *
join (Module *mod)
{
  Object *o_id = mod->get_variable ("id");
  assert (OBJ_IS_INT (o_id));
 
  size_t id = static_cast<size_t> (
      static_cast<IntegerConstant *> (
          static_cast<ConstantObject *> (o_id)->get_c ().get ())
          ->get_value ());
 
  assert (id < v_handles.get_size ());
 
  ThreadHandle *&th = v_handles[id];
 
  /* reasonable workaround since most threads run in a detached state */
  while (!th->get_done ())
    ;
 
  Object *ret = static_cast<Object *> (
      new ConstantObject (static_cast<Constant *> (new NoneConstant ())));
 
  IR (ret);
  return ret;
}

join_all

SF_API Object *
join_all (Module *mod)
{
  for (ThreadHandle *&th : v_handles)
    {
      if (th == nullptr)
        continue;
 
      /* could use a condition_variable here */
      while (!th->get_done ())
        ;
    }
 
  Object *ret = static_cast<Object *> (
      new ConstantObject (static_cast<Constant *> (new NoneConstant ())));
 
  IR (ret);
  return ret;
}

close

SF_API Object *
close (Module *mod)
{
  std::lock_guard<std::mutex> close (cre_mutex);
  Object *o_id = mod->get_variable ("id");
  assert (OBJ_IS_INT (o_id));
 
  size_t id = static_cast<size_t> (
      static_cast<IntegerConstant *> (
          static_cast<ConstantObject *> (o_id)->get_c ().get ())
          ->get_value ());
 
  ThreadHandle *&th = v_handles[id];
 
  if (id < v_handles.get_size ()
      && th != nullptr) /* release only when thread execution is done */
    {
      if (th->get_done () && !th->get_is_closed ())
        {
          delete th;
          th = nullptr;
          idx_avl.push_back (id);
        }
      else if (!th->get_done ())
        {
          th->get_is_closed () = true; /* detach */
        }
    }
 
  Object *ret = static_cast<Object *> (
      new ConstantObject (static_cast<Constant *> (new NoneConstant ())));
 
  IR (ret);
  return ret;
}

All of this constitutes the _Native_Thread module of Sunflower, which is used by the thread module to provide an object oriented wrapper around it. Here is the minimal implementation of such a wrapper

import '_Native_Thread' as client
 
class Thread
    id = -1
 
    fun _init (self, fn, args)
        self.id = client.create (fn, args)
    
    fun run (self)
        if self.id == -1
            return? "Thread has not been initialized or is closed"
 
        client.run (self.id)
    
    fun join (self)
        if self.id == -1
            return? "Thread has not been initialized or is closed"
 
        return client.join (self.id)
    
    fun _kill (self)
        # write ("called thread::_kill")
        if self.id != -1
            client.close (self.id)
 
fun _init (fn, args)
    return Thread (fn, args)
 
fun join_all ()
    client.join_all ()

This concludes Sunflower's thread implementation. Truly there is room for development, and this is only the first stable version of threading in Sunflower made possible with modern C++ and smart algorithms design. I will document my journey of language design in my own mini series here in my personal custom blog. Do let me know your thoughts, suggestions or resources where I could learn more from. Cheers!