A friend of mine needed to submit a Word document with track changes for school, but she needed it to be anonomized. The default behavior for word is to attach the name listed in Word preferences to every change when track changes is turned on. However if you change your name in the preferences all previous edits still remain under the old name in the document and are not editable.
With a series of command line functions you can change all the track changes in a word document to any name you need because the docx format is actually just a zip file of xml documents that contain all of the word doc’s content and meta data. (It’s specified here)
Here’s the code you can run in your mac’s terminal to change all of the track changes author names, it assumes the file you’re editing is on your desktop
cd ~/Desktop unzip myDocument.docx -d anonDocument/ grep -rl "w:author" ./anonDocument | xargs sed -i '' 's/w:author="[a-zA-Z0-9 ]*"/w:author="anonymous edit"/g' cd anonDocument zip -r ../cleanDocument.docx . cd .. rm -r anonDocument open cleanDocument.docx
This is what that does:
- change the working directory to the Desktop
- unzip the word doc into a new directory called anonDocument
- search all of the files in the word doc package and replace any comment or track changes author with anonymous edit
- change the working directory to the anonDocument directory
- create a new word doc on the Desktop called cleanDocument.docx
- change the working directory back to the Desktop
- remove unzipped document folder from the desktop
- open the new document
I had a fear was that it would be shrugged off with a whimper of notice. I was worried that people who I admired in the industry would see my efforts and think of them as trivial.
I still view Chartbuilder as a simple solution to a simple problem, why shouldn’t they think so too? (I also wrestle with the “I’m not a developer” complex described perfectly by Noah Veltman)
I wrote up a piece about how the tool has made Quartz better, and the internet–or at least the little corner of the internet that I operate in–went nuts.
Turns out, lots of people in news and elsewhere have been dying for a way to easily create and export charts as images.
Here are some people who I’ve never met (well, with the exception of the one who I went to middle school with…TRIVIA!) whose work and opinions I admire and respect saying nice things about Chartbuilder:
Chartbuilder: http://t.co/77QbyTm96A— Daring Fireball (@daringfireball)
— Nathan Yau (@flowingdata)
— Downtown Josh Brown (@ReformedBroker)
— Jared Keller (@jaredbkeller)
— Erin Sparling (@everyplace)
— Jeremy Bowers (@jeremybowers)
— jonathanstray (@jonathanstray)
— gabriel dance (@gabrieldance)
Now that it’s open source, the version of Chartbuilder today is significantly better than the version that only existed on Quartz servers last month.
Terrifying, exhilarating, and distracting. Also, incredibly fun.
A friend of mine was at a party last night. The hosts pull her aside and say,
They take her into the bathroom.
"For the last month we’ve looked out this window in our shower and saw this big ‘No’ taped on that wall over there…we couldn’t figure out what it was or why it would be there. We were obsessed. We would try to figure it out all the time. But now it’s gone.
Today we put hastheusgoneoffthefiscalcliff.com to sleep after a month of service, so we wanted to explain how it came to be.
How did the site work?
We had a DSLR plugged into a AC power supply, on a tripod, hooked up to a Mac Mini with a USB cable:
- The Mac Mini ran a bash script every 5 minutes through the crontab
- the script triggered a camera capture through the USB cable and downloaded the image
- the script created two smaller resized copies of the image (one for the site one for social media use)
- the script uploaded those images to our web server, replacing the previous captures
- the script put a timestamp in the full sized image’s filename and moved it to an archive on the Mac Mini (for posterity)
"http://hastheusgoneoffthefiscalcliff.com/imagecapture_1000.jpg?timestamp=" + (new Date()).getTime()
This redefined the path to the image every 2.5 minutes; appended with the date stamp as a parameter to make sure we dont get a cached version of the image.
The clickable areas were defined in a Google spreadsheet that was loaded in every time the page loaded and on each subsequent image replacement. We updated this document by hand every time we changed the wall.
How did the wall work?
The hard way: Every morning we got up, printed out some headlines, tweets, quotes and pictures, tiled them together and taped them to the wall.
The idea for the single serving site from the beginning was Zach’s. Sometime in October he noticed that that the hastheusgoneoffthefiscalcliff.com domain was available to register, and he brought it up as something maybe worth pursuing. The first idea he sketched out was this:
A man who slowly inches towards the edge of a cliff paired with links to stories around the web about the topic and an explanation of what the fiscal cliff was. That conversation devolved into the merits of different depictions of cliffs:
After seeing Brian Rea’s coverage of the US Presidential Election Night on Instagram and remembering the website for Sagmeister and Walsh, I thought about making a webcam of a wall of stuff. I made this as a proof of concept:
Everyone got on board, and this is what resulted over the next month:
- David Yanofsky
Development time: 2 days
Imagine the human being who took the time to make this. We must deeply honor the focus of that person.
Apparently it seemed like a crazy task to filter through almost 100 years of documents and tabulate information about them. Let’s asume they thought I was doing this by hand.
I wasn’t. Not even the GIF. Heres how:
Making the GIF
There’s a command line tool called ImageMagick that will both turn a PDF into a series of images and then turn that series of images into a GIF. These are the two commands using imagemagick I used to accomplish this:$ for infile in *.pdf; do convert -density 400 -resize 400 -trim -extent 500x700 -gravity north $infile jpeg/$infile.jpg; done
$ convert -delay 25 -loop 0 jpeg/f1040__*-0.jpg animated1040.gif
The first line tells ImageMagick to look at every PDF in the current directory, convert it to a 400px-wide PNG using a resolution of 300ppi for vector data, extend the edges of the image to 500px by 700px (anchoring the image to the top center of the new bounds), save it in the folder named jpeg. The second line tells ImageMagick to merge every .jpg file in the jpeg folder (i.e., every file I just created) with a file name ending in “-0.jpg” (this is the first page of the former PDF) into a GIF called “animated1040.gif” that flips through each image at 25 hundredths of a second and loops continuously.
After cleaning and optimizing it in Photoshop I had this.
Finding the files
All of this was dependent on having all of these 1040s. When I started looking for them, I was hoping some think tank or library would have an archive of the documents.
I decided to start simple. The current form is easy to find. A web search for “1040” revealed the IRS served PDF as the top result. Now what about the old forms? A web search for “2010 Form 1040” also returned a PDF on the IRS website but it had a slightly different URL: www.irs.gov/pub/irs-prior/f1040—2010.pdf. “irs-prior” — I like the look of that — “f1040—2010.pdf” Could all of the filenames be systematized?
Yes! A couple minutes of URL manipulation in my browser allowed me to find that there were files at this URL dating back to 1913 (though there were no forms for 1914 and 1915, since those years used the same 1913 form).
Downloading the docs
The next step was to download all of the files. Should I change the year in each URL and save as from my browser? TERRIBLE IDEA. I opened up my command line used the interactive prompt of python to download all the files super quick. It went something like this:$ python
>>> import urllib
>>> years = range(1916,2012)
>>> for y in years:
. . . urllib..urlretrieve("http://www.irs.gov/pub/irs-prior/f1040--%s.pdf" % (y,), "f1040--%s.pdf" % (y,))
. . .
Here’s what that means:
- start python
- load the library I need to download files
- create a list of years that I want to download: start in 1916 end in 2011 (one year before 2012) call it “years”
- cycle through every year in that list calling the current year “y”
- download the file using the url and naming system I figured out before, save the file using the same system
Two minutes later there were 97 PDFs in my folder for this project. I opened up the 1913 form in my browser and downloaded it. BOOM. Every 1040 ever.
So I now we had all these files and we had to quantify exactly how much more complex they got over time. My first idea was to use the amount of ink used on each document as a proxy for complexity. I wanted to count the number of black pixels in each document. I used ImageMagick to convert all the PDFs to images and could start counting pixels.
Using a Python library called PIL, I opened up each file with Python, converted it to grayscale and counted the number of black pixels, calculated the ratio of black-to-total pixels, associated that JPEG with the appropriate year and save that information as a JSON blob and CSV spreadsheet. I’ll save you from that code here, but you can see it here on github.
Using the CSV I got out of that, I made this chart showing the amount of “ink” on the form over time:
It was antithetical to what we knew was true. If the amount of printing was a proxy to complexity, this chart would show that the tax code is less complex than the first years of the system.
Were the older documents just bigger? Use larger type? I charted the same information but as a ratio of amount of black per page. Same story. Then I realized that the older documents have more instructions on them! (More recent 1040s include instructions in a separate appendix.) What if we just looked at the tabulation page. No luck. Apparently today’s documents are more ink-efficient than those of yesteryear.
I crafted a new strategy: count the number of line items on the form. (Our methodology for what we counted is recounted in the piece.) The slow way to do this would be to double-click each file in a document viewer, count how many lines were on each page and input that into a spreadsheet, hoping I dont miss anything or make a typo. (It was beer o’clock in the office.)
The fast way is to write more code. I created another Python script that would open up each page of every document individually and prompt me to enter how many lines were on that page and whether I should overwrite the current number of lines I recorded or add these to the number of lines already recorded. Once complete, the script saved a spreadsheet of the recorded information.
I made this chart.
Thinking about the transfer of instructions from the form to a separate document, I decided to take a look at the instructions booklet and see how those have changed over time. I used the same exact scripts as above to download all of the instruction files by changing the URL slightly to “i1040” from “f1040.” (This naming convention was also revealed by a web search for “1992 form 1040 instructions.”)
The recent documents were long: 2011 is nearly 200 pages. I used some more code to count the number of pages, and it looked like this:$ python
>>> from pyPdf import PdfFileReader
>>> years = range(1939,2012)
>>> for y in years:
. . . print y, PdfFileReader(file("i1040--%s.pdf"%(y,),"rb")).getNumPages()
. . .
I copied the data from the output (I didn’t save this to into a file for speed’s sake), pasted it into Excel, and made this chart:
Fifteen years ago, tax instructions were half the size! More striking, the booklets from the ’80s have smaller pages, but were still significantly shorter than today’s.
So now I had a GIF, three charts, and a whole bunch data. All that was left was words.
Read them all here: Line for line, US income taxes are more complex than ever
Development Time: 1 day
“Unfortunately, we have not selected your project to move forward”Now is when I try to come up with a way to do TableTent with no money. I’ve been thinking about using ScraperWiki.