Skip to main content

Set difference: Data provided from two big files with one number at a line

Today, I had to find out difference between two huge lists of numbers.
Numbers are 17 digits long and list are of around 1 lac.

PS: I'm documenting both versions here for my future reference.

I used python, because diff doesn't felt good for me. because it will print both ins and outs of both files. Also, I ruled out diff's possibility, because I didn't felt it will work at that time.
"""
# A shorter ugly version
total = set([i.strip() for i in open("total.txt").readlines()]) # list comprehension to remove \r\n from lines
coupon = set([i.strip() for i in open("coupon.txt").readlines()]) # set to remove duplicates and do set difference
open("result.txt", "w").write('\n'.join(sorted(total-coupon))) # use set difference and use sorted to sort then write in separate lines
"""
### Now see same thing above in beautiful & readable way
# reading file content into list
total = open("total.txt").readlines()
coupon = open("coupon.txt").readlines()
# striping "\r\n and/or spaces at ends from each line
total = [i.strip() for i in total]
coupon = [i.strip() for i in coupon]
#creating sets from list
total_set = set(total)
coupon_set = set(coupon)
# finding set difference
difference = total_set - coupon_set
#sorting the result
sorted_difference = sorted(difference)
# writing the result back into file, one number in a line
f_result = open("result.txt", "w")
f_result.write('\n'.join(sorted_diffrence))
f_result.close()

Later, at home: I give diff a try.
Yes, it's not that beautiful as in python. I'd cut and sed a bit.
But still, it's a "one liner" and I like 'em a lot.
$ ## First see the one liner
$ diff -bBw total.txt coupon.txt | grep '<' | cut -d'<' -f2 | sort -nu | sed -e 's/^[ \t]*//' > result.txt
$ ## Now dissect it
$ diff -bBw total.txt coupon.txt # it will give us diff, but there is changes for both files
$ diff -bBw total.txt coupon.txt | grep '<' # numbers which are not in total but in coupon
$ diff -bBw total.txt coupon.txt | grep '<' | cut -d'<' -f2 # remove leading '<' printed by diff
$ diff -bBw total.txt coupon.txt | grep '<' | cut -d'<' -f2 | sort -nu # numerically sort & remove duplicates
$ diff -bBw total.txt coupon.txt | grep '<' | cut -d'<' -f2 | sort -nu | sed -e 's/^[ \t]*//' # remove leading space
$ diff -bBw total.txt coupon.txt | grep '<' | cut -d'<' -f2 | sort -nu | sed -e 's/^[ \t]*//' > result.txt # write result into a file



Comments

Popular posts from this blog

AJAX File Upload with Web2py

It was not that long, since I experienced a problem while trying to upload a file using an ajax  trapped form. I thought, it must be me doing something wrong. I was using web2py  to embed another page into a page via ajax. That is better known to web2py folk as LOADing a component. It's just happened that one of such component contains a file upload form. It was my first time using LOAD function provided by web2py. Basically it make use of jQuery to load the page via ajax into a target div and traps input of any form in that page, so that page doesn't reload. Oh, I forgot to say that web2py is bundled with jQuery. It's always boring and tedious to understand a problem without experiencing it. So, Let's play with an example, (PS: I"m using web2py a full stack python framework, but you can use any language at server side and this problem will be there because, it's a problem with ajax) My mod...

Start on Microchip programming... for hobby or for money

One of my friend asked me today the following question,  I'm often asked about this, someway or other. Let me answer this now for all.... Q: " I want to start programming on chip.. Can you suggest a good chip and a device to program it? Also tell me any sites which can help me ." A:" Simple one is Arduino. You will get it packaged with a programmer. If you want some more powerful and commercial one, Go for Microchip's PIC family of processors. After you are familiar with those, and need even more power, try AVR from ATMEL ." Some resources from my Bookmarks is given below: http://www.voti.nl/swp/ http://www.embedds.com/ http://www.instructables.com/id/Business-Card-PIC-Programmer/step2/Parts/ http://www.arduino.cc/playground/Main/ElectroInfoResources http://www.piclist.com/techref/microchip/index.htm My bookmarks become so messy now a days, and I'm not getting time to organize them. So, these are the quickest ones that I pic...

My First Python Program

I am very glad today. Because I finally wrote a python program all by myself. I am programming for about 3 years. Of which 2 are using C++ (Old standard and using Turbo C++ IDE ver 3.0 and yet to master Templates and STL. [:-p]) and After starting python using Dive into Python an excellent book by Mark Pilgrim during my 1st year summer vacation, and I only completed Data Structure section. Then I found an excellent Java tutorial by Sang Shin and obtained a certificate by completing First and basic course in Java. Now I am working with My Friend to develop applications in java. We established a web site already. He started programming when he is in 10, ie. more than 2 years of experience. He has Visual Basic too in his side. Now he is doing with JSP and I am concentrating on Python, Ruby (yet to start) and CSS. Today My pleasure is that I completed a python program myself. Which is asked to do in ' A byte of Python ' by Swaroop.C.H. Which is a command line program; and he...