Here’s a simple programming task: read everything from a file into memory. To do this, you need to open the file, read the data, and stop reading when you reach the end of the file (abbreviated “EOF”).
But how do you KNOW you’ve reached the end of a file? That’s a simple question with a slightly complex/misleading answer.
The Obvious Approach
We need some way to know that we’ve read all the data in a file. Reviewing the C++ docs, you might reasonably land on this approach:
std::ifstream stream("file.bytes");
if(stream.is_open())
{
while(!stream.eof())
{
char byte = stream.get();
// do something with "byte"
}
}
This seems totally reasonable, and if you state the above logic in plain English, it sounds correct. “While we’re not yet at the end of the file, retrieve the next byte from the file.”
However, there’s a problem: if you read every byte in the file, eof()
still returns false! It is only when you try to read one more element that it starts to return true!
For example, say you have a 10 byte file that you read using the above code. After reading the 10th byte, eof()
still returns false. It is only when you attempt to read the (non-existent) 11th byte that eof()
starts to return true.
So, you end up executing the loop one more time than intended. If there’s nothing to read, get()
returns -1. So the loop executes one more time with that (garbage) value. Hope that doesn’t cause any problems…
Is This Always a Concern?
I’ve encountered this “while-not-eof” pattern in student assignments, coding examples, and production code. It seems to work in a lot of cases - why is that?
One reason is that std::getline
will often save you. If you’re reading full lines, std::getline
is nice enough to check this for you and return the correct result. So you wouldn’t even know there was danger afoot.
You also probably wouldn’t notice this problem if you just copy the file into a big buffer using ifstream::read
. In this case, the buffer is usually known to be bigger than the file’s contents (in which case the buffer contains all the file data as expected). Or you may know exactly how many bytes to read from the file (in which case you don’t actually need to detect EOF).
However, if you need to read byte data from a file until you reach the end, this issue can byte you (heh).
The Hacky Solution
A key thing to realize is that eof()
is not actually doing what you want it to do. You want it to signal that you’ve read all the data in the file. But it’s actually signaling when you’ve read TOO MUCH data.
In other words, it’s an off-by-one problem. Instead of telling you when you’ve read (n) bytes, it tells you when you’ve read (n+1) bytes.
With that in mind, one option is to check eof()
two times:
std::ifstream stream("file.bytes");
if(stream.is_open())
{
while(!stream.eof())
{
char byte = stream.get();
if(!stream.eof())
{
// do something with "byte"
}
}
}
This is a bit clunky, but it does stop from running the loop an extra time with garbage data.
The True Solution
Ideally though, we’d like just the one conditional check. Fortunately, there is a way to achieve this:
std::ifstream stream("file.bytes");
if(stream.is_open())
{
char byte;
while(stream.get(byte))
{
// do something with "byte"
}
}
This works well, though I’d argue this API is a bit too clever & obtuse for its own good:
- The
get
function returns a reference to the stream it was called on. In other words, the stream returns a reference to itself. - The stream object overrides the boolean operator to return
true
if the stream has not encountered any error. Since we check this after callingget
, it tells us whether our last call toget
was successful or not.
Streams store three flags to represent error states: eofbit
, failbit
, and badbit
. If any one is set, it probably means the stream is no longer safe to read from.
The boolean operator override checks if either failbit
or badbit
are set. But we need to check eofbit
don’t we!?
Fortunately, attempting to read past the end of the file sets both failbit
and eofbit
, so the boolean operator override does work for an end-of-file check (as well as guarding against other theoretical error states too, which is nice).
Conclusion
As mentioned above, this all boils down to an off-by-one misunderstanding. Whereas a lot of APIs operate under the idiom “check if this thing is OK and then use it,” streams use the idiom “use it, but then check if anything went wrong.”
This behavior seems perplexing - why would they design the API like this? A post on Stack Overflow makes a good point:
While this [behavior] seems confusing for files, which typically know their size, EOF is not known until a read is attempted on some devices, such as pipes and network sockets.
So, in a way, this behavior is the result of the standard library needing to accommodate a very generalized abstraction. Streams allow reading/writing data, but the actual underlying data source/destination can vary wildly. The abstraction can have benefits (read from any source using one class) as well as drawbacks (needing to think about how pipes or network sockets work when you’re just trying to read a file).