Dec 242011
 

I have been spending some of my free time trying to build a complete cricket statistics database by parsing records from Cricinfo. However scraping HTML pages is an ardous task. There is simply no standard way of achieving it and often becomes a struggle with regular expressions. A good solution to this problem is the Html Agility Pack. Its a library which standardizes parsing of HTML pages and converts them into a XML style DOM object that you can extract data from. There are a good number of options for error checking (for HTML which is not XHTML compliant)

The API is very similar to the XmlDocument class in System.Xml namespace and hence there is hardly any learning curve. You can search for nodes based on the Xpath expression of the element you want to search. Now getting the xpath can be a bit tricky, so an easier way would be to use a chrome extension called XPath Helper. Once this extension is installed and activated, press Ctrl+Shift+X to activate and then shift to give the xpath of any particular element on which the mouse is hovering. The given XPath can be easily tailored to get the whole set of data which we need to extract.

Now, its time to start scraping. Download Html Agility pack from Codeplex and add a reference to the dll. Its a pretty simple code to get the webpage as a string , then load it in the HTML Agility pack and let it create the DOM structure. Then the XPath is used to get the list of rows in the table and each row is translated into an innings object and added to a collection. At the end its written to a csv file that can be converted to an excel spreadsheet. The code is pretty rough and I did it more for a trial. When the complete database will be built it will become much more difficult since it would involve parsing of different kind of pages and ensure integrity of data.

class Program
    {
        static void Main(string[] args)
        {
            new ReadText().StartParsing();
        }

        
    }

    class ReadText
    {
        public void StartParsing()
        { 
            string TestUrl = "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;filter=advanced;page={0};orderby=start;size=200;template=results;type=batting;view=innings;wrappertype=print";
            Console.WriteLine("Extracting Tests\n\n");
            ExtractInningsView(TestUrl, "..\\..\\AllTestInnings.csv",404);
            Console.WriteLine("Extracting ODIs\n\n");
            string ODIUrl = "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=2;filter=advanced;page={0};orderby=start;size=200;template=results;type=batting;view=innings;wrappertype=print";
            ExtractInningsView(ODIUrl,"..\\..\\AllODIInnings.csv",356);

        }

        private void ExtractInningsView(string statUrl,string fileName,int pageCount)
        {
            List<InningsPlayed> AllInnings = new List<InningsPlayed>();
            for (int j = 1; j < pageCount; j++)
            {
                Console.WriteLine("Reading Page: " + j.ToString());
                string pageText = ReadWebPage(String.Format(statUrl, j));
                var htmlDoc = new HtmlDocument();
                htmlDoc.LoadHtml(pageText);

                for (int i = 1; i < 200; i++)
                {
                    string inningsXpath = "//tbody/tr[@class='data1'][{0}]/td";
                    var nodeList = htmlDoc.DocumentNode.SelectNodes(String.Format(inningsXpath, i));

                    if (nodeList != null)
                    {
                        AllInnings.Add(new InningsPlayed()
                        {
                            Name = nodeList[0].InnerText,
                            Runs = nodeList[1].InnerText,
                            Minutes = nodeList[2].InnerText,
                            BallsFaced = nodeList[3].InnerText,
                            Fours = nodeList[4].InnerText,
                            Sixes = nodeList[5].InnerText,
                            StrikeRate = nodeList[6].InnerText,
                            Innings = nodeList[7].InnerText,
                            Opposition = nodeList[9].InnerText,
                            Ground = nodeList[10].InnerText,
                            StartDate = nodeList[11].InnerText
                        });
                    }
                }
            }
            DumpToFile(AllInnings,fileName);
        }

        private void DumpToFile(List<InningsPlayed> AllInnings,string fileName)
        {
            StreamWriter writer = new StreamWriter(fileName);
            
            StringBuilder builder = new StringBuilder();
            writer.WriteLine("Name,Runs,Minutes,Balls_Faced,Fours,Sixes,StrikeRate,Innings,Opposition,Ground,StartDate");
            int iterations = 0;
            foreach (var inning in AllInnings)
            {
                writer.WriteLine(string.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10}", inning.Name, inning.Runs, inning.Minutes, inning.BallsFaced, inning.Fours, inning.Sixes, inning.StrikeRate, inning.Innings, inning.Opposition, inning.Ground, inning.StartDate));
                iterations++;
                if (iterations % 10 == 0)
                    writer.Flush();
            }
            
        }

        private string ReadWebPage(string Url)
        {
            // Reading Web page content in c# program
            //Specify the Web page to read
            WebRequest request = WebRequest.Create(Url);
            //Get the response
            WebResponse response = request.GetResponse();
            //Read the stream from the response
            StreamReader reader = new StreamReader(response.GetResponseStream());
            return reader.ReadToEnd();
        }

        
    }

    class InningsPlayed
    {
        public string Name { get; set; }
        public string Runs { get; set; }
        public string BallsFaced { get; set; }
        public string Minutes { get; set; }
        public string Fours { get; set; }
        public string Sixes { get; set; }
        public string StrikeRate { get; set; }
        public string Innings { get; set; }
        public string Opposition { get; set; }
        public string Ground { get; set; }
        public string StartDate { get; set; }
    }
  • S Krupa Shankar

    Good, but you would be safer if you delete the first line ;-) Screenscrapping is illegal, is not it?

  • http://www.facebook.com/people/Seeyar-Ahmadzai/766006739 Seeyar Ahmadzai

    Hello Ganesh, 

    I was wondering whether it it possible to to create a cricinfo-style data-base for a wordpress bassed site?

    It doesn’t have to be a 100% copy, just  match scorecards & the simplified batting & bowling statistics like for batting, there should be these fields, No. of MATCHES, No. of INNINGS, NOT OUT, RUNS, AVERAGE, STRIKE RATE, 50′s, 100′s, Best Score, No. of 4′s, No. of 6′s
    And for bowling, there should be these fields, OVERS, MAIDENS, RUNS, WICKETS, AVERAGE, STRIKE RATE, BEST BOWLING PERFORMANCE. If it could be programmed such that it fetches these data automatically from match scorecards, which one can type manually. I’m building a sports website covering Afghanistan, for which i need this. While I’ve fund a promising plugin for football ( http://wordpress.org/extend/plugins/phpleague/ ) for the cricket part i’m totally blank :(
    I hope you can help me out :)

  • http://blog.ganeshzone.net Ganesh R

    hi Seeyar,

    I don’t think there are any automatic tools available in the market currently. There are two options

    1) Build a custom scraper for the cricinfo match scorecards.
    2) Purchase a license to cricket stats who give db access (this is more easier)

    I have an outdated stats db (around an year old) with complete information which was extracted out of cricinfo which I can share with you, but I dont have the code to recreate it with the latest db.

  • Ben

    Hi Ganesh,

    Thanks for this post – really interesting.Have you tried to scrape ball by ball scores, on either cricinfo or cricket archive?  And Above you mentioned a paid for service for access to cricket stats – can you tell me where that is?Many thanks,Ben

  • fret

    Hi Ganesh,
    Wanted to discuss this with you – could you please send me your email address?

    - Fret