Techcrunch had a piece on Sunday describing how startups were using Hadoop so I thought I might share our experiences with both Hadoop and Cassandra. Up and until 9 months ago ShopSavvy was a strictly MySQL shop. Pricenark (our backend system) ran exclusively on MySQL and stored only the bare minimum amount of information we needed to return prices to our users. Even though we were delivering 10s of millions of retailer pages with ShopSavvy, the amount of data that we kept from each query was minimal. In order to maximize ShopSavvy’s capabilities going forward, we needed to make the most out of every drop of our data. We needed a storage system that could keep long histories of offers, prices, product information, scan information, geo-location data, and merchant data while maintaining 100% uptime in a very write-heavy environment. On top of just simple storage we needed a way to run algorithms over the expanse of data that we already owned in log files and MySQL, as well as all of the historical data we would soon be storing.
Looking at our requirements for our data processing platform we very quickly settled on Hadoop to do our bulk processing. Before storing any new data, we needed to go through the log files that ShopSavvy had been storing for nearly 2 years. These files had amounted to nearly 500 million lines of logs that included each of our users scans, searches, licensee app usage, and very valuable geolocation data. It was clear that Hadoop would be a great fit for processing this kind of data. Hadoop specializes in processing massive amounts of loosely organized data through a distributed environment of commodity hardware; and has a great track record with processing log files.
Being a startup we don’t have the budget for top of the line machines with lots of cores, tons of drives, and lots of ram; but we do have the budget for lots of lesser powered machines. Using hadoop we are able to get the most out of our budget. We are able to string together slightly older server hardware to process all of our data in a scalable fashion, getting the most out of every piece of hardware. Since hadoop clusters are easy to add to we can easily add more machines to the cluster when we have the budget!
We also quickly settled on Cassandra for our storage technology. Looking at our options in the space, Cassandra was easily the most stable and mature project for storing massive amounts of data and retrieving it quickly. Cassandra offers high write throughput, data replication, elasticity, multi-language API and a great track record with large data.
With Cassandra we don’t have to worry about data loss if we lose a machine, or running low on capacity when we store more data than we expected; we can simply add more machines or repair the ones that go down. This is a huge advantage to a small team like ours. We don’t have a dedicated ops team to manage all of our servers and we can’t afford any downtime in an environment where you should be constantly prepared to show your product.
Cassandra also integrates well with Hadoop processing. We are able to pull data from any time period and any size quickly into Hadoop to provide us with analytical insights into our business. We can schedule hourly, daily, weekly, and monthly jobs to tell us where our business is heading and what our users are doing. Cassandra gives us quick access to all of our data in a quick and concise way that we can depend on.
By storing every little piece of information inside of Cassandra and processing it with Hadoop, we are able to provide a lot more value to our users and quickly develop new products. In a short amount of time we have been able to develop capabilities such as keyword search, image matching, related products, geographic analysis, automated daily reporting, price history analysis and charting, user history tools for customer service, investor charts and reports, device syncing, non real time data scraping, top scanned products, and trend analysis.