Archives ::

Thu, 01 Apr 2010

Now and again I get asked a question by a student or an aspiring programmer. It's of the form, ``I've been programming in <comfy language> and now I want to learn some C. What should I concentrate on?'' So, here's some areas I think you need to look at when moving from a higher-level language to C, some of which will only be appropriate when coming from certain languages:

  • Pointers - some higher-level languages manage pointer dereferencing themselves. Java, for example, is syntactically rather quiet about pointers. In C, you have to declare things as pointers, remember that they're pointers and dereference them when you want access to the things they point to. Oh, and if you dereference a pointer that is accidentally not pointing at what you expected, then if you're lucky you'll get a segmentation violation, and if you're unlucky you'll get some weird unpredictable behaviour in your code. If you're not used to this, I recommend finding some specific exercises to improve your pointer-handling.
  • Memory management - a typical high-level language will provide a "new" operator, allocate some appropriate space in memory, and once the memory isn't being used it will eventually reclaim it with a garbage-collector. In C, none of this is the case. To allocate memory, you call the malloc library function. To signal that the memory is available for further allocation, you use the free library function. If you malloc memory and there's no way for that memory to become free, it will just stay allocated throughout the lifetime of the program - the so-called memory leak. There are tools that help, of course - valgrind tracks memory allocations and there are garbage collectors that you can add to your programs - but the fundamental issue is that any time you allocate memory, you need to assign responsibility for freeing it up again.
  • Inheritance/variant records - Most modern programming languages provide at least something like variant records, in which different record formats are combined into a single type. In more object-oriented languages, a class hierarchy can be used to a similar effect, but with variant behaviour thrown in as well. C has neither of these things. Its union facility allows a type to contain different record formats, but it is up to the programmer to keep track of which member's format is being used at any given moment. Inherited behaviour are possible with function pointers, often wrapped up in a set of macros.
  • Interfaces - Interfaces in high-level languages typically identify a set of method signatures that a class must implement. This mechanism allows the specification of an interface without also specifying an implementation, useful in a variety of inheritance idioms. C has no specific construct that replicates this, having no built-in mechanism for mapping a name to a different behaviour depending on the type of an object.
  • Arrays - Arrays are conceptually a fixed-length list of items of a certain type. Arrays are typically safe and easy to use in high-level languages, and there are often libraries of specialised containers for more advanced purposes. In C, arrays are given a fixed size, but there are no checks made on accesses to the array. Furthermore, there is a kind of interchangeability of pointers and arrays that is idiomatic in C but lends itself to abuse. The problems can be surmounted with macros, defensive programming or external checking (e.g. with valgrind).
  • Iteration - Many high-level languages have powerful iterators that step over a data structure to perform some action element by element. For arrays in C, you have two choices for your loop (apart from the style choice of choosing while or for): you can choose to advance a pointer across the array, or you can choose to advance an index over the array.
  • Type-safety - In many languages, there is a level of type-safety. This ensures that an identifier in a program is guaranteed to relate to a meaningful value of a given type. C is not strong on applying domain-level meanings to data; it operates very much in terms of numbers and pointers. It keeps no run-time type information and allows arbitrary type conversions, particularly from pointers to untyped pointers (void *) and back again.
  • Strings - In C, a string is just a pointer to a piece of memory, which contains a character. The string begins at the memory location pointed to, and increments through the memory space, ending when a zero byte is reached. This has some particular consequences - firstly, comparison of pointers to characters will just show whether they point to the same string, not necessarily whether they point to strings that contain the same characters. Secondly, any operation on a string needs to preserve the end-of-string zero marker to ensure that the result can be interpreted as a string. Finally, it is not possible for a string to contain a zero (NUL) byte. Some of these problems are mitigated through the use of standard library routines; it is also possible to create a structure that stores a string length alongside the string contents. Newcomers to C may also find it difficult to think about characters and strings as entirely different types.
  • Standard library - Modern programming languages come with rich and varied libraries to handle containers, databases, GUIs and a whole lot more. In contrast, the C standard library is pretty thin, and is focused on low-level resources like memory, files, strings and so on. However, to compensate for this slight omission, pretty much any utility library is available for use in a C program.
  • Exceptions - C does not have the structured exception mechanism of other languages. Instead, errors are sometimes signalled by a non-zero return code from a function, and sometimes through a mysterious globally-accessible "errno" variable. C programmers often forget to check for exceptional conditions, which leads to crash-prone code. Functions in C can only return a single value as the function result, which further encourages the omission or sidelining of error signalling.
  • Input/output - The C input/output system is heavily constrained. Most of the library routines work on fixed-length buffers. In other languages, there is a uniform model of streams or readers and writers. For C, there is one system for dealing with files, a system for dealing with sockets, and then a whole series of ioctl calls for managing low-level details. There is still little built-in support for unicode characters.
  • Initialisation - In most useful languages, data is initialised to some kind of empty representation before it can be used. In C, anything that has not been initialised will contain whatever leftover data happens to be in the corresponding memory location from some previous activity. Failing to initialise data in a C program is a common source of problems.
  • Polymorphism - The concept of polymorphism appears in many languages. This refers to different behaviours that are called depending on the type of any arguments, direct object or return value specified. Polymorphism is key to object-oriented programs. In C, there is no polymorphism - a particular name for a function represents a particular address of a piece of code to execute. Some forms of polymorphism can be achieved in C through the use of union types.
  • Lambdas - in some languages, it is possible to create an expression that denotes a behaviour as written inside that expression. In C, this kind of structure does not really exist. However, it is possible to write the behaviour out as a function and then create a pointer that points to that behaviour, and pass that pointer to another function to be called.
  • Packaging - C provides two levels of scope - file scope and block scope. The file scope runs from the point of declaration to the end of the file, and is used for "global" data. The block scope runs from the point of declaration to the end of the enclosing block. At any point in the program, a given name refers to a particular entity declared with that name, starting at the block in which the reference occurs and moving out to enclosing scopes until a scope is found in which the name is present. Since function definitions do not nest within other function definitions, there is no possibility of defining closures as found in languages like Scheme.

Maybe in some future posts I'll try to pull together some specific discussion of each of these topics in more depth.


Vivi wrote at 2010-04-05 11:21:

Thanks for writing that up, a nice read. Hope you write some more :D

Add a comment:

Comments are closed for this story.