In-Place Merge Sort Demystified

There are various implementations of in-place merge sort around the web, including my own, and various papers describing procedures for in-place merging. Then a few days ago, while researching Smoothsort, I stumbled upon a page titled Smoothsort Demystified which gives a nicely detailed explanation of the sorting algorithm. I haven’t seen such a tutorial for in-place merge sort. At best, I’ve only seen vague explanations or descriptions which use terminology that many people are unlikely to understand. Even Wikipedia only gives a couple details of in-place merging without any adequate explanation or demonstration. I figured, I know a technique for in-place merge sort and how to implement it, so I’ll take a stab at writing my own tutorial.

Update 8/31/14: It seems that my initial understanding of “rotations” was naive at best. I plan to revise this post to describe three in-place merging techniques: insertion -> range swaps -> rotations. Also, thanks to BonzaiThePenguin, I’m now aware of an in-place stable sorting algorithm called Block sort (and another link) which runs in O(n log n) time. The algorithm seems highly complex and I haven’t had time to study it yet.

Before reading any further, it is highly recommended that you already understand merge sort and preferably have even implemented it yourself. If not, there are more than enough resources online that will adequately teach you the merge sort algorithm, including several YouTube videos. It would be best to familiarize yourself with the algorithm before attempting to follow my tutorial.

What do you mean by “in-place”? A standard merge sort requires you to allocate a buffer equal in length to the input being sorted. For example, if you wish to sort an array with 1000 elements, it is necessary to allocate an additional scratch array with 1000 elements as a space for merging elements before copying them back to the input array. This is why merge sort has O(n) space complexity. An in-place sorting algorithm doesn’t require allocating any additional memory for sorting. Several algorithms meet this requirement, including insertion sort and heap sort which have O(1) space complexity. However, in-place merge sort has O(log n) space complexity. This is because, like quick sort, it is a recursive function which requires pushing elements onto the stack.

How does in-place merging affect the performance and complexity of merge sort? A naive implementation may be 2-3x slower than a standard merge sort. In-place merging only requires a minor increase in comparisons. The worst effect on performance occurs due to the significant increase in reads and writes (there is lots of swapping going on). I will explain a couple of optimizations towards the end which can significantly increase the performance of the algorithm.

Is in-place merge sort stable? Yes! As long as it is implemented correctly, in-place merge sort will preserve the stability of merge sort.

Why would I want to use an in-place merge sort? The truth is, you probably wouldn’t. The only reason I can fathom is if you’re working with embedded systems with very limited memory. In the vast majority of cases, merge sort or Timsort would be a better choice. Otherwise, this is merely for those who are interested or graduate students who need a topic for their thesis.

Is this applicable to Timsort? Only partially. It is perfectly possible to implement an in-place natural merge sort. However, Timsort is an adaptive sort and has a galloping mode which can significantly reduce the number of comparisons on partially sorted data. If you were to implement an in-place Timsort, it would be impossible to implement galloping mode as well… at least using the technique I describe in this post.


Before I move forward, I should define what is meant by rotations as they are used heavily in the merging process. Actually, rotations are used in a limited form here in a way that is functionally equivalent to std::swap_ranges in C++ (aka block swap). First, I’ll explain how rotations are used in the algorithm, then I’ll explain what rotations really are and why it’s still a proper usage of the term.

For in-place merging, rotations are used to swap the elements in two adjacent sublists which are equal in length. As you will soon see, this technique is used to move a large number of elements at once.


That is the basic operation of in-place merging. So what are rotations anyways? To rotate elements in a sublist means to shift the elements some number of places to right and elements at the end wrap around to the beginning (and vice versa). So if you have the list [1,2,3,4,5], shifting it one place to the right will give you [5,1,2,3,4].

The fact that the two ranges being swapped are adjacent is important in calling this operation a rotation. If the length of the sublist being rotated is an even number, then shifting the elements half that distance results in the same operation as swap_ranges above. In the following graphic, I shift the elements one place to the right three times, achieving the same result as before.


That’s one way to look at rotations. The way that std::rotate in C++ is portrayed is different. Rather than shifting elements N places to the left or right, std::rotate rotates the elements on a pivot indicated by some middle element, such that the middle element becomes the new first element. Following that definition, if the middle element is M, then shifting the elements M places to the left achieves the same result.

In-Place Merging

Now that I’ve explained rotations in too much detail, I can explain how to use them for in-place merging. Let us define two sorted sublists, A and B, that we want to merge in-place.


The trick is to swap the largest elements in A for the smallest elements in B. A linear search can find the range of elements to rotate. Start from the middle and extend outwards, until you find an element in A which is less than an element in B. Rotate the elements between these bounds.


That’s the basic operation. However, something special happened. Of the two ranges, the smallest elements are in A and the largest elements are in B, albeit not in order. This means you can continue to merge A and B independently, so all you’re left with is two smaller merges. This also means that in-place merging is a recursive function in itself.

By recursively applying rotations in this manner, eventually you’ll merge the two ranges. Let’s apply this process once more and see what happens:


Just like that, the right-hand side is already sorted and the left-hand side is almost sorted. A couple more rotations on the left-hand side and the merge procedure would be complete.

In essence, each rotation moves the elements that much closer to their final location. When the sublists are large, rotations can move thousands of elements at a time. But as the sublists become smaller and smaller, rotations become less efficient. When the sublists become small enough, there are a couple of techniques you could apply that would be more efficient than applying several more rotations.

Binary Search

In the previous section, I stated that a linear search can be used to find the range of elements to rotate. While that is fine for small sublists, it would be very impractical for large sublists with thousands or millions of elements. Using a linear search, it may be necessary to compare thousands of elements to find the bounds of the range to rotate.

In general, a better approach is to use a binary search which is guaranteed to find the bounds within O(log n) comparisons. A binary search can be tricky to implement as it’s easy to get stuck in an infinite loop. However, the end result is worth it as it will find you the bounds much quicker and with far fewer comparisons.

Start half-way from the center and move the search inwards or outwards, depending on the result of the comparison. It’s important to only compare elements which are equidistant from the center!


Optimizing Using Insertions

When the sublists become small enough, using insertions to move the last few elements into place would be more efficient than using rotations. However, how do you decide whether to use insertions or rotations? What I do is define a threshold for when to switch to insertion.

A well optimized merge sort typically uses insertion sort to sort small sublists before merging anything. Similarly, a well optimized quick sort uses insertion sort once a partition is small enough. By referring to these as examples, I typically use insertion sort on sublists up to 32 elements in length. In a worst-case, this would imply (32+31+30+29+28+…) insertions. This can be calculated directly using the formula (n*(n+1)/2). For 32 elements, this gives us a worst-case of 528 insertions.

Now we need a formula to determine whether the number of insertions required to merge two lists is below this threshold. This is actually simple: just multiply the length of both lists. If one sublist has 10 elements and the other has 15 elements, this gives us a worst-case of (10*15=150) insertions.

Using a scratch array for merging

By applying rotations, you effectively reduce two large sublists into several smaller sublists. When the sublists become small enough, a very effective optimization is to simply allocate a small buffer to merge the sublists in. This will allow you to greatly reduce the number of rotations and is far more efficient than using insertions.

This can be further optimized as well. While most implementations of merge sort have O(n) space complexity, it is trivial to implement it such that it has O(n/2) space complexity and is generally a bit quicker as well. This is the same technique that Timsort uses to achieve O(n/2) worst-case space complexity.

Begin by copying one of the sublists into the buffer. Then simply merge the elements in-place. Depending on which sublist is smaller, it may be necessary to merge the elements in reverse order. In the following graphic, the green cells represent a small buffer which is large enough to store a copy of the right sublist.


This is a great technique to greatly reduce the space complexity while only being marginally slower, albeit technically it wouldn’t be an in-place merge sort anymore. This technique is also applicable to Timsort and allows you to retain some of the benefit of galloping mode. I have implemented my own variant of Timsort with O(n/1024) space complexity.


In this post, I have demonstrated a practical technique to merge two sublists in place. It has limited applicability in the real world, but is interesting none the less and is a topic of research that has been worthy of publication. When writing this tutorial, I did my best to explain things in simpler terms and to provide some intuition for why this technique works. I simply hope that it was easy to understand and that you learned something new from reading it.

I’m looking for feedback on this article. Please feel free to post any questions or comments here or on Reddit.


Sanha pt. 2 – Containers and Composition

I’ve decided I wanted to share a little more of the ideas of my imaginary programming language. I say imaginary because it is merely a design; there is no compiler or interpreter for this language. In my previous post, I gave an introduction to the syntax and type system.

Disclaimer: The examples I provide are merely meant to demonstrate the features, not to show practical usage. As such, some of the examples may appear crude and poorly written.

Review of Types

In my former post, I covered a bit about the type system in Sanha. In essence, a hybrid of dynamic typing and static typing is employed. When a variable is declared, it may be initialized with a value of any type. But once a variable is initialized, it’s type cannot change for the remainder of it’s lifetime.

def foo(#cond bool)
    if(cond) return "Ten"
    else return 10
#value = foo(true)
value = 30 // Error - Cannot assign int32 to string

It’s possible to explicitly define types for variables or function arguments, though there is no explicit type syntax. Rather, objects of various forms may be interpreted as types in certain contexts. For example, [string] is an array containing a typeid, but it may be interpreted as an array of strings.

#arr [string] = ["Ten", "Twenty", "Thirty"]

I don’t know if there is a term for this paradigm; I simply refer to it as types as expressions. Any expression that results in an object of a special form may be interpreted as a type.

def foo(){ return [int32] }
#arr foo() = [10, 20, 30]


For this post, I will introduce the built-in container types and the capabilities they provide. I define a container as a collection of elements. There are three basic container types: Arrays, Tuples, and Objects.

Arrays in this language are much like other languages: They contain an ordered sequence of elements of a single type of varying length. Square brackets [] are used to write array literals.

#arr [int32] = [10, 20, 30, 40, 50]
arr.length == 5

Tuples contain an ordered list of multiple types of a fixed length. Each index has a fixed type which cannot change once the tuple is initialized. They support random-access just like arrays. Since the elements can differ in type, tuples are one of the ways that you can employ dynamic typing in Sanha.

#tuple (string, int32) = ("Thirty Five", 35)
tuple.length == 2
tuple[0] == "Thirty Five"
tuple[1] == 35
tuple[0] = "Forty" // OK
tuple[0] = 40 // Error - Cannot assign int32 to string
tuple[1] = 40 // OK

Unlike tuples in D, which automatically expands the elements anywhere it’s referred, Sanha does not do this. This means it’s possible to have nested tuples.

#tuple = ( (10, 20, 30) , (40, 50, 60) )

Since tuples are of a fixed length, it’s also possible to specify names for each element and access them as members of the tuple. In this way, tuples are resemblant of C-style structs.

#tuple (#s string, #i int32) = (#s = "Forty", #i = 40)
tuple[0] is tuple.s
tuple[1] is tuple.i

Objects are an unordered set of members of varying types. As the name suggests, they are the basis of OOP, though there will be explicitly typed classes as well. Curly braces {} are used to write object literals. Objects may also contain methods (i.e. sub-functions).

#obj = {
    #s = "Fifty"
    #i = 50
    def timesTen()
        return i * 10
obj.i == 50
obj.timesTen() == 500

In my previous post, there was a blurb in one of the examples of sets being part of the language. I decided against that to enable certain semantics in the language which I’ll demonstrate later in this post. However, I figured this, and other types of containers like hash tables, could be implemented as part of the standard library. I expect that I may include literals for a few types of containers which are simply implemented in the standard library.

Member Grouping

So far, everything that you have seen so far is pretty conventional. However, if I ended it there, that would be boring. Before I get into composition, I wanted to introduce a smaller feature which I call member grouping.

Member grouping allows you to create a container using one or more members of an object. The following example demonstrates how to use the members of an object to construct an array, tuple, or another object:

#obj = {
    #a = 10
    #b = 20
    #c = 30
#array = obj.[a, b, c]
#tuple = obj.(a, b, c)
#obj2  = obj.{ #one = a, #two = b, #three = c }


This is my favorite feature of my own language. I refer to it as composition because it’s an extension of list comprehension but more powerful and flexible. In essence, composition encompasses the notion that you may write any sequence of statements and expressions which are evaluated in order and concatenated into a single container.

It’s important to know that many features which are ordinarily statements in other languages are expressions in Sanha. This includes control statements such as if/else or loops. The second important fact is that elements in a container may be separated by line-breaks as well as commas. So the following code is valid:

#arr = [
    if(cond) 600
arr.length // may be 3 or 4, depending on result of (cond)

Now this is where things get fun. Loops are iterated multiple times, so the result is that loops insert multiple elements into a container. This is the functional equivalent of list comprehension from other languages.

// arr contains the values 0 to 99
#arr = [loop(#i in 0..100) i]
arr.length == 100

If you put the two prior examples together, you can construct a container from any sequence of expressions and statements.

#arr = [
    loop(#i in 0..10) i
    loop(#i in 30..40) i
    if(cond) rand(0..100) // Made-up random generator function
    loop(#x in 0..10) loop(#y in 0..10) x*y

Arrays have the limitation that all of the elements must be of the same type. However, composition works in tuples as well. Just change the square brackets [] to parenthesis () and you have a tuple!

// Not everything needs to be on a separate line
#tuple = ("The powers of two", loop(#i in 0..8) i**2)

Objects are a bit different. Objects require you to specify a name for every element. However, you may still include non-declarative statements in objects and they’ll be executed in order.

#obj = {
    #i = 100, #j = 200
    if(cond) printLine(i) // Result of expression is ignored
    #k = 300

There are two keywords which assist in composition: unpack and mixin.

Unpack is a statement which will expand the elements of any iteratable object into an array or tuple. It’s no different than using a loop to perform the same operation. The keyword simply provides a convenient shortcut and it proves to be useful to overcome some inconveniences in the syntax.

#arr = [unpack 0..100]
#tuple = (unpack arr)

Mixin is a statement which will copy of the members of an object into another object (applicable to tuples with named elements as well, though the elements will be in no particular order). So all of the members of that object will be redeclared in the current scope.

#obj = { #a = 10, #b = 20, #c = 30 }
#obj2 = {
    mixin obj
    #d = 40, #e = 50
obj2.a == 10
obj2.e == 5

 Function Arguments

A little secret about function arguments, they’re nothing more than tuples. So all of the features I’ve demonstrated on tuples may be applied to function calls as well. For example, function arguments may be separated by line breaks:

    "Hello, World!"
    "The current date is: "
    "The current time is: "

Suppose one function returns a tuple and you want to pass those elements as arguments to another function. All you need to do is use the unpack keyword to expand the elements into the function call.

def one(){ return (10, 20, 30) }
def sum3(#a #b #c int32){ return a + b + c }
#value = sum3(unpack one())
value == 60

Hidden Semantics

Now that I have covered containers and composition, I can demonstrate how they affect the semantics of the syntax.

Many languages use curly braces for structured programming. Usually, there’s not much in terms of semantics here. However, considering that control statements can be used in expressions, how does it affect the result when curly braces are introduced? A few languages, including Rust, choose to make the last statement in a block to be the result of the expression. That is not what happens in Sanha though. Sanha always defines curly braces to mean objects. So if your control statement is followed by curly braces, the result is going to be an object.

// A crude example...
#obj = if(cond)
    #i = 30
    #j = 40
    #i = 50
    #j = 60
obj.(i, j) // may equal (30, 40) or (50, 60)

There is no rule forcing you to use curly braces either. You can just as well use parenthesis to nest your code, though this goes against convention and is highly discouraged.

    printLine("Hello, World!")

The semantics state that control statements are always followed by a single statement or expression. While containers may have multiple statements and expressions, the entirety of the container is a single expression.

Let’s make things a bit more interesting. You know these semantics apply to control statements. Now consider, I define functions to be control statements as well because, in a sense, they also control the flow of execution. So this means that the same semantics of control statements apply to functions as well.

Nothing is ever implicitly returned from functions. A function returns nothing if there is not any type of return statement (of which there are a few).

// Single statement function
def foo() return 35
// Nesting using arrays? Why not!
def bar()
    return "nutin"

Program Design

The semantics I’ve covered in this post may raise some questions. I’ll try to clarify a few things here.

There are a few “gotchas” to look out for, such as:

// Expression continues after closing curly brace
if(true) { printLine() } printLine() // Error
// Correct:
if(true) { printLine() }
// Alternatively, you may use a semi-colon or comma:
if(true) { printLine() } ; printLine()

Suppose you have an if/else statement nested using curly braces and you want the result of that statement to be a single value and not an object. Simply store the desired value in a variable and extract it afterwards:

#value = if(cond) {
    #value1 = rand(), #value2 = rand()
    #x = if(value1 <= value2) value1 else value2
}.x // Extract value of x here

What if you want to extract two or more values into an array? Use member grouping and the unpack keyword.

// A crude example...
#arr = [
    unpack if(cond) {
        #a = 10, #b = 20, #c = 30
    }.(a, b, c)
    else {
        #a = 40, #b = 50, #c = 60
    }.(a, b, c)

The semantics allow you to open up your functions in various ways:

def foo(#arg)
if(typeof(arg) == string)

Closing Note

Everything I’ve covered so far does beg the question: Wouldn’t this be a horribly inefficient language? I’ve put a lot of thought into the design and features of this language. There are certain aspects of the language that may be inefficient, especially those which utilize dynamic typing or JIT’ing. The most important thing to know is that this is meant to be a compiled language and there are several opportunities for optimization and elimination of dead code.

Semantically, using curly braces for structured programming requires creating objects everywhere. However, a compiler would be smart enough to know when an object isn’t captured or stored in a variable, so it can eliminate the object altogether and allocate variables on the stack rather than the heap.

While a variable can be initialized with any type, it’s type cannot change after it is initialized. In the majority of cases, a type for that variable can be deduced at compile-time so there is zero overhead at runtime.

Ultimately, this language is nothing more than a design and I have no plans to actually implement a compiler or interpreter for this language. This is simply something I enjoy pondering about and have put a great amount of thought into. It gives me something to write about and perhaps somebody out there will find my ideas interesting enough to do something with them.