Facebook preserves some of your metadata; Adds mysterious strings
A nefarious metadata plot has been unearthed and news of it is streaking around the interwebs. An Australian law student named Edin Jusupovic was casually looking at photos downloaded from Facebook in a hex editor and tweeted about what he saw.
And from there, as they say, the rest was history. When last I looked, Jusupovic’s tweet had been retweeted 16,637 times, there were nearly 2,000 mostly clueless replies to his thread, and no less a journalistic standard-bearer than Forbes had weighed in: “Facebook Embeds ‘Hidden Codes’ To Track Who Sees And Shares Your Photos”, cried their headline.
(Regular readers may note that we now know of two individuals geeky enough to casually peruse hex dumps of photos, the other being, well, me. Looking at a photo in a hex editor kind of reminds me of seeing an automobile through a microscope. Yes, the car is there, sure enough. But just glancing out the window and seeing the thing sitting on the driveway is a better way to see it.)
Hmmm. What’s this?
What Jusupovic came to “discover” was that Facebook had written a string of characters in the IPTC Special Instructions field on this, and every other as far as I know, photo on their site. What might be “structurally abnormal” about that, I haven’t a clue. But on with the story…. The irony here is that as far as metadata is concerned, Facebook is the least pernicious of the social media platforms. Facebook respects at least some of your copyright management metadata.
The irony here is that as far as metadata is concerned, Facebook is the least pernicious of the social media platforms. Facebook respects at least some of your copyright management metadata.
“The take from this is that they can potentially track photos outside of their own platform with a disturbing level of precision about who originally uploaded the photo (and much more).”, Jusupovic elaborated in a further tweet.
So, I presume we are to infer that Facebook is a bunch of shifty characters who do stuff behind our backs, and that metadata is very scary. The former is pretty obvious by now to anyone with an internet account and the latter is probably not a notion that readers of this blog especially support.
Let’s unravel this and see what we do and do not know. (Yes, a Pulp Fiction reference.)
UPDATE – The IPTC’s Metadata Working Group has weighed in on this story. Their post can be seen here: https://iptc.org/news/what-does-facebook-do-with-your-photo-metadata/ Note that David Rieck’s work, quoted in the IPTC article, indicates that Facebook does preserve creator and copyright information in the Exif data block. I missed this. Mea Culpa. Below, I give Facebook some credit for their treatment of Exif. This means they deserve a bit more credit still.
First off, while Jusupovic’s discovery came about whilst poring over the strings view window of his hex editor, which is looking rather more deeply into a photo than most of us might, we can also see the Special Instructions field by clicking on “Get Info” on a photo on a Mac desktop. Not exactly NSA-level sleuthing.
Hiding in plain sight?
That’s because IPTC metadata is a universal standard. It’s public. It’s where you put stuff that you want everybody to be able to see and read. To a greater or lesser extent, both popular consumer operating systems display IPTC metadata at a right-click. Any competent graphics program – about half of them, according to the IPTC – will display a fairly comprehensive view of IPTC fields. Those metadata fields would amount to an awfully in-your-face place for “Hidden Codes”. Whatever Facebook is doing, it’s doing publicly, which is oddly transparent for an outfit that everybody, including me, likes to demonize for underhanded dealings.
Then there is the matter of a “Shocking level of tracking..“ and, as Jusupovic went on to tweet, …“potentially track[ing] photos outside of their own platform with a disturbing level of precision about who originally uploaded the photo (and much more).” Before we look into what actually happens to photos that roundtrip through Facebook, let’s unpack this.
Let’s bear in mind that Facebook owns its database. In it, they have tucked away every little salacious detail about you that they think might ever be worth a penny or two. They know what you like and don’t like, who you like and don’t like, what your politics are, and, of course, they have a record of everything you’ve ever said on the platform. They have an archive of every image you’ve ever looked at, never mind uploaded.
So, a “Shocking level of tracking”? Absolutely. But does a public-facing ID number on a photograph significantly add to the damage? I rather doubt it. Let’s face it, on Facebook itself, the company knows everything about you and they have no reason to record any of that valuable data where you or a competitor might see it.
Off the reservation…
Off the Facebook platform is another story. Conceivably, someone could download a photo from Facebook and upload it to some other website. (That’s copyright infringement, by the way. Not that anyone except people who used to make a living at photography would care.) In such a case, Facebook could conceivably crawl the web and see that “their” picture had found its way off the reservation. If such a picture was to be re-uploaded back to Facebook, Facebook would be able to tell where it came from. And from whom. Right?
Sort of. Not really. As regular readers are all too sadly aware, the imaging libraries on most websites strip all metadata off photos. This blog constantly pleads for website operators to do the community-minded thing and configure their sites to honor metadata. But, despite my efforts, according to a study by Imatag, 85% of photos on the web carry no metadata at all, and 97% don’t identify their copyright owner.
If Facebook is trying to track photos off their site, this would be a lousy way to do it. Yes, given their scale, maybe they could glean some data about where photos that leave their site go generally. Er, check that. They can’t. The skewed data would indicate that photos stolen from Facebook always end up on websites that don’t strip metadata, so that won’t work. As a way to surveil individuals, this is a no-go.
Maybe another way
Steganography, embedding hidden data directly in an image’s pixels, would be a better bet. And ever so much sneakier.
Not that Facebook would actually need to employ steganography.
We know that Facebook uses artificial intelligence image recognition to understand both who is in any given photo and what is happening in that photo. Likewise, AI can identify the photo itself.Thus obviating any need for tricky invisible watermarking. Think Google Images reverse search employed in grand scale by the dark side. No need for steganography, then.
On the other hand
The irony here is that as far as metadata is concerned, Facebook is the least pernicious of the social media platforms. Facebook respects at least some of your copyright management metadata. As far as I know, they are the only social media platform to do so.
A trip to Facebook
In order to look at what Facebook does with metadata, I sent a picture on a round trip journey to and from the platform. I took a picture of an ibis and pasted onto it the entire metadata contents of the IPTC test image. Every single IPTC field was filled with a descriptive string.
This is what the picture looked like upon its return:
Here is all the IPTC metadata from the image in one place, as seen in ExifTool output. You’ll note that there are eight fields represented and that they are only recorded in the IIM data block. This last is clever from a bandwidth perspective. IIM data is very efficient indeed. XMP is much bulkier. Most all software can read IIM, so the lack of XMP doesn’t do much harm. Field names shown are the IIM names, which differ in Exiftool from the XMP ones.
My bird picture came back resampled from a hefty 5.8 MB to an emaciated 82 kilobytes. Facebook stripped the entire Exif data block, with its potentially troublesome geotagging data and bandwidth-wasting thumbnails. As they did with the XMP data block.
What was left
Remaining, in the IIM only, were eight IPTC fields. Six of those, Creator, Copyright, Creator’s Job Title, Credit, and weirdly, Source and Description Writer, were preserved as they were in the original. Two, Transmission Reference and Special Instructions, were overwritten with strings from Facebook. Content in all other fields was deleted.
The ICC profile was preserved, too. (One day, I’ll do a post about IIC profiles. Until then, it’s pretty safe not to worry about them.)
So, we should (relatively) commend Facebook for not destroying all the copyright management information in my picture.
They left enough intact that someone of benign intent downloading a photo should be able to figure out who owns it and contact that person for permission to use it. Assuming that the photographer bothered to write their name on their image in the first place, that is.
Then there’s the matter of those strings they wrote. That’s where all the fun is.
Starting with the string in Transref.
Every content management system (they run websites, like this one) and digital asset management system (they store pictures) I’ve ever encountered assigns a unique ID number to each media file in its library. DAMs commonly write that unique ID to IPTC metadata, just in case humans might need it one day. That’s in addition, of course, to keeping it in the database, where it’s needed for the system to work. Usually, such a system writes the ID number to the Transref, because that’s the sort of thing that field was designed for.
(More or less. The original purpose of the Transref field was to hold the Transmission Reference, which is a string that identifies a wirephoto by a code indicating where it came from and an arbitrary sequence number. Nowadays we use it for job ticket numbers and all sorts of things.)
Facebook puts a very random-looking string of (in my sample) 20 alpha-numeric characters in the Transref. Twenty alpha-numeric characters could represent a number in the hundreds of nonillions. A nonillion is 1030. That’s a biiiig number. A trillion is a paltry 1012. The Transref field holds 32 characters in the IIM version. Even just using 20 of those, Facebook would have available plenty of unique numbers for each of its photos for the next little while.
(The IPTC’s own investigation of this matter revealed that the string in Transmission Reference does not change if the downloaded image is uploaded back to Facebook and re-downloaded. This further suggests that the Trasref string is indeed a unique ID. Remember that I’m speculating here. Facebook could do us all a service by simply telling us what they re doing.)
We’re all unique
While we are thinking about unique identifiers, we should consider that the filename on my sample image is 45 digits. Just numbers and underscores, but still, 45 figures makes a pretty good size number. It’s quite possible that the filename is also a unique identifier. While we really have no idea, we can surmise without being too silly that either or both of the filename and the string in the Transref might be a unique ID.
The string in Special Instructions is 99 characters long. We don’t know what it is, but it’s pretty clear it wouldn’t fit in Transmission Reference. If I remember correctly, in the IIM, Special Instructions can hold 256 characters.
Internet gadflies are twisting themselves into a conspiracy theory pretzel trying to parse what, if any, data might be being carried in the long Special instructions string. Come to think about it, that’s a pretty interesting question.
If the information is for Facebook’s use, why put it on the photo? Anything that uniquely identifies the photo would refer back to the appropriate place in the database, where we can assume all the relevant information about your entire life and being is stored. If the information is for our use, what it is and how to use it must have been in a memo that I missed.
Personally, given the many series of repeating and apparently place-holding zeros in that string, I’m betting that it does carry data. I would certainly be interested in knowing what kind, for whom, and generally, what the heck.
So, what do we know here?
The top line is that, of all the social media platforms, Facebook shows the most respect for the information you attach to your work. Is that the simple to say, and almost as simple to do, “respect embedded metadata”? No. I can’t think of an excuse for removing a picture’s caption, and whatever their reasoning for removing the remaining IIM fields was, we can safely criticize it for being pretty thin. Seriously, most of those fields are usually blank. If they hold values, it’s a safe assumption that the photographer put them there for a reason. I’m not terribly comfortable with removing the XMP data block in the long-term, but I’d be a hypocrite not to admit that I’ve advised clients to do the same thing in the here and now to same some bandwidth. (And at Facebook’s scale, the amount of storage space involved is not insignificant.)
Overall, we should commend Facebook for respecting enough metadata to allow Google Images “Image Credits” to work and to give an honest person a fair shot at figuring out who owns a given Facebook-sourced image.
I seriously doubt the strings are for surveillance. The etiquette of surveilling requires doing it with some degree of stealth. The government wouldn’t follow you around in marked police cars and a blimp. That would be rude. They would use black helicopters and shifty-looking guys in dark sunglasses. After all, politeness counts.
Is there a black van parked in front of your house? Have you decoded the 99-digit mystery number? Jump in the comments and say something we haven’t read on Twitter.