Copying Strings In C
In C, strings are just arrays of chars, which are bytes. This means that, in memory, a string spans across some block of memory at a sequence of addresses. Because of this, it’s not really possible to just simply assign one array to another in one operation like you can in a language like Python where you could do this:
string_1 = "Hello"
string_2 = "Good Morning"
string_1 = string_2
Now, in C, we can assign integers to each other so something like this is fine:
int x = 0;
int y = 5;
x = y;
So, why can’t we do the same thing with strings? Well, to answer that, we have to look at what’s happening under the hood. Since C is a much lower level of abstraction than Python, it basically is mapped directly to some assembly code output. So what would the above look like in assembly? Of course, the actually assembly code would vary by architecture but let’s assume it’s for x86-64.
mov DWORD PTR [rbp-4], 0 // move the value 0 into the first memory address
mov DWORD PTR [rbp-8], 5 // move the value 5 into the second memory address
mov eax, DWORD PTR [rbp-8] // move the contents of the second memory address into a register
mov DWORD PTR [rbp-4], eax // move the contents of that register into the first memory address
The main thing we wanna pay attention to here is that, even though assigning y to x is one line of C code, it actually takes two assembly instructions. First, the contents of the second memory address (in this case 5) has to be moved into a register, Then, the contents of that register are moved into the first memory address. The Instruction Set Architecture, or ISA, just doesn’t allow moving from one memory address directly to another, so we have to have that intermediate step.
But anyway, let’s take a step back away from the assembly code and just think about a naïve implementation of copying a byte array would have to work. At the low level, we’d have to move each value in a sequence of memory over to some other sequence of memory. The thing about strings in C is that, the null terminator is used to determine what the end of a string is, and it always has to be there. The null terminator, represented by “\0” is just a byte of all zeros. So, although we don’t ever know what the length of a string is, we can just iterate through each value after the initial pointer and keep going until we hit that null terminator. So that’s why you can’t just assign a string to another string… there’s a loop involved. Personally, I like the fact that there’s an actual function that you have to call for copying strings because it makes you think about what’s actually going on in the memory. So strcpy looks something like this:
void strcpy(char dest[], const char source[]) {
int i = 0;
while (1) {
dest[i] = source[i];
if (dest[i] == '\0') {
break;
}
i++;
}
}
What this does is iterates through the source char array and copies the value of each address of it to each address of the destination string, then exits the loop when it hits that null terminator.
Caution!!! This is why strcpy is so dangerous though! You can see that if the source string is longer than the destination string, the function doesn’t care… it just keeps copying until the source is all there.
This will cause undefined behavior in your program because you’ll potentially be overwriting memory addresses where other values are already stored, that’s why you actually shouldn’t ever use strcpy. Just use strlcpy. It actually does bounds checking for you and will truncate what’s copied into the destination string. It takes a “size” parameter that lets you determine the maximum number of characters that are written to the new char array. Also, it will return the difference in the length of what was actually written to the destination vs what the size parameter is. So if that return value is less than the size parameter, you know the whole thing was copied successfully, and if it’s more than the size parameter, you know that the source string was truncated, and by how much (attempted - size).
Only downside is strlcpy isn’t part of the standard C library. It’s only available on BSD, Mac, Solaris, Android, and via libbsd on Linux. So if you don’t have access, just use strncpy. But strncpy won’t add a null terminator if the source is larger than n, so you’ll have to make sure you manually add it otherwise, you get undefined behavior. Just don’t use strcpy because it doesn’t do any kind of bounds checking and can cause major issues.
It’s also worth mentioning that you can do this though:
char *string_1 = "Hello";
char *string_2 = "Good Morning";
string_1 = string_2;
What's happening here is that you're declaring the two strings as pointer variables that point to each block of memory. So you can just assign the value of one pointer to another. This is called a shallow copy. You're not actually copying the memory block, just the address that's being pointed to. If you do this and access the value of the destination string, it will be the same as the source, but since it's actually referencing the same block of memory in the source, changing string_2 later will also reflect in string_1.
This is fine if you're managing your strings on the stack because stack memory is cleaned up after the function returns. But be careful, because if you do it on the heap, you would cause a memory leak. This is because the block of memory that the string_1 pointer pointed to would be unreachable, like this:
char *string_1 = malloc(6); // allocate 6 bytes on the heap
strcpy(string_1, "Hello");
char *string_2 = malloc(13);
strcpy(string_2, "Good Morning");
string_1 = string_2; // shallow copy — string_1 now points to string_2's block
// the original 6 byte "Hello" block is now unreachable
// nothing points to it, but it was never freed — leak
So you would need to be sure to call free() on string_1 before assigning it to the new block, like this:
free(string_1); // free the old block first
string_1 = string_2; // then reassign
The last thing I think I should mention is that the assembly generated by the string copy functions doesn't necessarliy copy one byte at a time. It can actually be 'smart' enough to copy 8 bytes at a time (for example if on a 64bit architecture), and check if each one of those chunks contains a zero byte (the null termintor for the string). If it does, it would iterate through that last chunk of bytes to find out on which byte it actually ends.
So yeah, that's basically how strings are copied in C!