Machine code, Assembly and High Level Languages explained

First written on 07/11/2023

Introduction

This article is going to be explaining the differences between machine code, assembly and high level languages. My intention in writing this was to show where assembly language stands relative to the two worlds of software - the dark and mysterious world of machine code and the friendly and familiar world of high level languages. Even though this article touches some quite complicated aspects of software engineering, I have attempted to make it comprehensible by a more general audience.

Before I talk about computer languages, I would like to explain a little about processors.

What is a processor?

A processor is a machine that can do many things. It can take two numbers and add them. It can store the output of this addition somewhere in memory and then retrieve it and multiply with another number. It can then take this product and subtract the number one from it continuously until the result becomes zero.

These are just a few examples of what processors can do. They sound pretty boring until you wire the processor up with some external hardware and make it add two numbers only if it's raining on a friday. Then make it send the result of this addition to you by email.

There is an important difference between processors and most other machines that we see in our daily lives.

For example, consider a coffee machine. It was designed to make coffee. What if you wanted your coffee machine to also make tea? Maybe you could redesign it, but that would alter it's original form. The point is, a coffee machine is a fixed-function device. It does only one thing.

Now consider a group of three friends who lie around on a couch and do nothing all day. A fourth friend then joins them and starts commanding them. He tells one to go to the market and buy some eggs, the other to find a pan and the third to make some fried eggs once the first two return. After the meal is complete, the fourth friend has some lunch while sending the group of three to go fishing.

This last example is a type of system that is capable of doing many different things. The three friends could even make themselves coffee if they wanted to, without using a coffee machine.

In order to come closer to our topic, let us consider yet another example. An AND gate is a type of electronic component that accepts two input signals and gives one output signal. It is going to output a 1 (represented by a high voltage signal) only if the two input signals are both 1s. In the case that either one or both of the input signals are 0 (represented by a low voltage signal), it will output a 0.

This is a fixed-function device. It can only do one thing. Without altering the design of this component, you cannot make it do something else.

A processor is like a group of three friends implemented in a silicon chip. It can act as an AND gate, an OR gate, a NOT gate or some combination of these two gates. It can be utilised to play Minecraft or control a sattelite orbiting mars.

An important point from the example of three friends is that nothing would have happened unless someone came in and told the three what to do. Processors work in the same way. They need to be told what to do.

If you were to command a group of people, you would probably use some human language, like English. Processors don't understand English, however, they have their own language. This language consists of numbers represented as electric signals that the processor can understand. If you wanted someone to bring you a pan, you would tell them "Hey, could you please bring me a pan?". If you wanted a processor to add two numbers together, you would have to give it a specific sequence of high and low electric signals (1s and 0s). This sequence would tell the processor what operation to perform. For this reason, such sequences are often called "operation codes". They are also sometimes called "machine code" or "instructions". The act of telling a processor or a computer what to do is known as "programming".

Machine code is hard for humans to understand. Assembly tries to fix this

Machine code is designed to make it easy for processors to understand. It wasn't made for humans to deal with. To program a processor, a programmer (typically a human being) must give it instructions, but the language in which these instructions are to be written in is incredibly hard to comprehend.

To overcome this problem, programmers created a language called "assembly". Assembly is essentially a human-friendly way to represent machine code. While operation codes consist of numbers that mean nothing to a human, assembly consists of text commands which are typically derived from languages such as English. Assembly may look nothing like English at the first glance, but it utimately takes inspiration from human language and was designed for humans to read and write.

Ok, we have a language that humans can understand, but can the processor understand it? No, it cannot. Processors can only understand machine code. Assembly is meaningless gibberish for them. In this case, we need someone to do a translation for us. This "someone" is typically a piece of software called an "assembler". An assembler takes assembly code (human understandable) as an input and outputs machine code (processor understandable). The act of doing so is called "assembling".

Software is usually just machine code mixed up with data stored in a file on your computer. An assembler is then a machine-code-producing machine code. Not only can processors be programmed to do a lot of interesting things, but they can also help us to program them.

Lets look at an example of what assembly looks like. Here is a command for a processor written in plain English:

"Add 5 and 6, then store the result in a register (a type of memory)".

In assembly, this may look something like this:


MOV ax, 5 

ADD ax, 6

The two lines above are in fact instructions for a 16-bit intel processor. When assembled, these instructions will turn into a sequence of bytes that the processor can later run (a byte is basically just an element used by computers to store numbers). There are many ways to represent these bytes. Here is a representation in hexadecimal (a type of number system):


b8 05 00 83 c0 06

Here is a binary representation (using 1s and 0s):


10111000 00000101 00000000 10000011 11000000 00000110

Information can generally be represented in many different ways. Internally, inside a computer, all information is represented in binary, as sequences of high and low electrical signals.

Alot of times, people say that "machine code is 1s and 0s and assembly is text". This is true from a certain perspective, but things become confusing once you realise that even files containing assembly "text" are internally represented as 1s and 0s. The sequences of 1s and 0s in an assembly file and in an equivalent machine code file and not the same. They are stored in a similar manner, but what matters is the way each file is to be used. Assembly files are typically opened by a programmer, using a text editor to make changes to the code or simply read it. An assembly file can also be passed to an assembler to turn it into a file containing machine code. Machine code files can be "ran" or "executed" by feeding the instructions contained within them into a processor.

Beyond assembly: High level languages

In the world that we live in today, where programmers may write hundreds of lines of code on a daily basis and software sizes scale up to billions of bytes, assembly becomes impractical for general use.

Notice how we went up in levels of abstraction when going from machine code to assembly. From numbers to something that resembled English. The next step is to go even higher: create a language that is even easier for a human to understand. This next level of languages are called "high level languages".

Examples of high level languages are C, C++, Java, Python, etc. The function of a high level language is similar to that of an assembly language - to take English-like instructions written by a human and turn them into instructions for a processor.

There are many different classifications of high level languages that differ in what methods the language uses to produce the machine code. A language such as C is called a "compiled language". The C code has to be fed to a software called a "compiler" which would turn it first into assembly code and then feed this assembly code to an assembler to turn it into machine code. Usually, the part involving assembly is done in the background during compilation. The programmer only sees C code turn straight into machine code.

Here is an example of C code which does something similar to what our assembly code did:


int x, y; 

x = 5; 

y = x + 6;

This particular example doesn't really show the English-like vocabulary of C, but it nevertheless is easier for a human to understand due to it's similarity with mathematical notation. Compare this to the assembly shown previously.

The examples given here are very simple. They do not highlight all the differences between assembly and high level languages. Generally, a programmer would have to write much more assembly code than C code in order to implement the same thing.

One advantage of assembly over high level languages is that you are usually more in control of the resulting machine code. Compilers typically do a lot of optimization. They may take a part of your code and replace it with some other code that does the same thing but much more faster. An assembler may do some optimization, but usually the optimizations that it does aren't as sophisticated as those done by a C compiler.

Conclusion

In this article, I have talked about the differences between the types of languages that can be used to program processors - Machine code, assembly languages and high level languages. I have also given an overview of processors and the ways programmers can instruct computers what to do.

Have something to say? You can find my contacts on the contact page.