重構一個運行超過10年的老項目


  去年下半年我接手了一個外包的項目維護任務,這個項目大約開始於2005年,項目用的是傳統的三層架構,主要功能就是一個網絡爬蟲,爬取國外各種電商的商品數據,存入客戶數據庫。最近我對項目的重構已經通過驗收,我想說說我的重構思路。

階段一 熟悉了項目框架,了解項目的運行和維護。

使用工具: Microsoft Visual Studio2005 , SQL SERVER2005, axosoft ontime scrum,SVN

開發流程:客戶提供需求文檔,編碼,單元測試,UAT部署,UAT測試,客戶部署,QA測試

項目分層:

在這個階段,我發現了幾個問題

  1. 很多需求文檔已經丟失
  2. 代碼邏輯與需求文檔不匹配
  3. 大量重復代碼
  4. 用於匹配數據的正則表達式全部存儲於數據庫,難以閱讀,不方便修改
  5. 很多正則變大時過於復雜
<li\s*[^>]+list-view-*(?<css>[^"]*)"[^>]*>\s*  <h[^>]*>\s*<a[\s\S]*?href=[\d\D]{1,100}?(?<=MLA-)(?<id>\d+)[^<]*>\s* (?<name>[\d\D]{0,500}?)</a>\s*  (?:<a[^>]*>)?\s*<i\s*class="ch-icon-heart"></i>\s*</a>\s*</h\d+>\s*  (?:<p\s*[^>]+list-view-item-subtitle">(?<subTitle>[\d\D]{0,5000}?)</p>)?\s*  (?:<ul[^>]*>(?<subTitle2>[\d\D]{0,5000}?)</ul>)?\s*   (?:<a\s*href=[^>]+>)?\s*(?:<im[\d\D]{1,200}?(?:-|_)MLA(?<photo>[^\.]+)[^>]+>)?\s*(?:</a>)?\s*(?:<img[\d\D]{1,200}?images/(?<photo2>[^\.]+)[^>]+>)?\s*(?:</a>)?\s*  [\d\D]*?  <\s*[^>]+price-info">\s*  (?:<[^>]+price-info-cost">(?:[\d\D]*?)<strong\s*[^>]+price">\s*(?<currency>[^\d\&]*)(?:&nbsp;)?(?<price>\d+(?:.\d{3})* (?: .\d+)? ) \s*(?:<sup>(?<priceDecimales>\d*)</sup>\s*)?  (?: \s*<span[^>]*>[^<]*</span>)?  \s*</strong>\s*(?:</div>\s*)?  (?:<strong\s*[^>]+price-info-auction">(?<type>[^<]*)</strong>)?\s*  (?:<span\s*[^>]+price-info-auction-endTime">[^<\d]*?(?:(?<day>\d+)d)?\s*(?:(?<hour>\d+)h)? \s*(?:(?<minute>\d+)m)? \s* (?:(?<second>\d+)s)?\s*</span>\s*)?(?:</span>)?\s*  (?:<span\s*[^>]+price-info-installments"><span\s*class=installmentsQuantity>(?<numberOfPayment>\d+)</span>[^<]+  <span\s*[^>]+price">\s*[^<]*?(?<pricePayment>\d+(?:.\d{3})* (?: .\d+)? )\s*<sup>(?<pricePaymentDecimales>[\d\D]{0,10}?)</sup>\s*</span>\s*  </span>\s*)?|<[^>]*[^>]+price-info-cost-agreed">[^>]*</[^>]*>\s*)(?:</p>)?\s*  [\d\D]*?  (?:<ul\s*class="medal-list">\s*<li\s*[^>]+mercadolider[^>]*>(?<sellerBagde>[\d\D]{0,500}?)</li>\s*</ul>\s*)?  <ul\s*[^>]+extra-info">\s*(?:<li\s*class="ch-ico\s*search[^>]+">[^<]*</li>\s*)?  (?:<li\s*[^>]+mercadolider[^>]*>(?<sellerBagde>[\d\D]{0,500}?)</li>)?\s*(?:<!--\s*-->)?\s*  (?:<li\s*[^>]+[^>]*(?:condition|inmobiliaria|concesionaria)">\s*(?:<strong>)?(?<condition>[^\d<]*?)(?:</strong>)?\s*</li>\s*)?\s*  (?:<li\s*[^>]+"extra-info-sold">(?<bids>\d+)*[^<]*</li>\s*)?  (?:  <li\s*[^>]+[^>]*location">(?<location>[^<]*?)\s*</li>\s*(?:<li\s*class="free-shipping">[^<]*</li>\s*)?  |<li>(?<location>[^<]*?)\s*</li>\s*)?(?:<li\s*class="free-shipping">[^<]*</li>\s*)?  (?:</ul> |<li[^>]*>\s*Tel.?:\s*(?<phone>[^<]+)</li>)  |  <div\s*[^>]+item-[^>]*>\s*<h[^>]*>\s*<a\s*href=[\d\D]{1,100}?(?<=MLA-)(?<id>\d+)[^<]*>\s* (?<name>[\d\D]{0,500}?)</a>\s*</h3>\s*  (?:[\d\D]*?)<li\s*[^>]+costs"><span\s*[^>]+price">\s*(?<currency>[^\d\&]*)(?:&nbsp;)?(?<price>\d+(?:.\d{3})* (?: .\d+)? ) \s*</span></li>  (?:[\d\D]*?)(?:</ul> |<li[^>]*>\s*Tel.?:\s*(?:\&nbsp;)*(?<phone>[^<]+)</li>)
一個復雜的正則表達式

階段二 完善全部需求文檔,將所有正則提取成文件

開發流程增加最后一環,更新文檔。當測試或維護完成后,必須修改需求文檔,將所有正則提取成文件,減少維護SQL的工作量,減少新人維護sql出錯的可能性。在我和QA的努力下,200多份需求文檔被重新整理完畢,為維護項目提供思路。

階段三 修改數據訪問層

去除傳統的數據訪問層代碼,於是准備上Entity Framework,和客戶溝通后,客戶更熟悉Nhibernate,於是封裝Repository,這個倉儲層封裝和領域驅動沒有多大關系,只是一個大號的DbHelper而已.

        public void SaveInfoByCity(InfoByCity line, string config)
        {
            SQLQuery query = new SQLQuery();

            query.CommandType = CommandType.StoredProcedure;
            query.CommandText = "HangZhou_InsertInfoByCity";

            SqlParameter[] parameters = new SqlParameter[7];
            parameters[0] = new SqlParameter("@City", line.City);
            parameters[1] = new SqlParameter("@AvailableUnits", line.AvailableUnits);
            parameters[2] = new SqlParameter("@AvailableSqm", line.AvailableSqm);
            parameters[3] = new SqlParameter("@ResAvailUnits", line.ResAvailUnits);
            parameters[4] = new SqlParameter("@ResAvailSqm", line.ResAvailSqm);
            parameters[5] = new SqlParameter("@ReservedUnits", line.ReservedUnits);
            parameters[6] = new SqlParameter("@ReservedSqm", line.ReservedSqm);

            SqlHelper.ExecuteNonQuery(ConnectionStringManager.GetConnectionString(CALLER_ASSEMBLY_NAME, config),
                query.CommandType, query.CommandText, parameters);

        }
去除Dao代碼

        /// <summary>
        /// SaveRestaurant
        /// </summary>
        /// <param name="restaurant"></param>
        public void SaveRestaurant(Restaurant restaurant)
        {
            restaurant.RunId = RunId;
            restaurant.RunDate = RunDate;
            restaurant.InsertUpdateDate = DateTime.Now;
            RepositoryHelper.CreateEntity(restaurant);
        }
保存數據

階段四 去除大量重復代碼

所有的業務抽象出來就三個部分 下載 匹配 保存,因此封裝了大量的公共方法,是每個任務編碼更加簡單,易於維護。

階段五 修改匹配方式

項目原來就是利用正則匹配數據,有些網站數據比較復雜,導致正則過大。而且往往網站稍微改變一點點,整個正則就匹配不到任何數據,正則維護難度也比較大。

首選我想到的就是封裝樹狀結構,將正則分化治之。試運行一段時間后發現,維護調試樹狀結構的正則表達式簡直要命,於是放棄。但是我覺的將頁面無限分割,再進行匹配的思路應該是正確的,因為這樣更加容易維護。在思考和搜索中,我發現了HtmlSelector。用HtmlSelector做DOM選擇,然后用正則匹配細節。逐漸封裝成現在的樣子,下面提供一個案例。

using System;
using System.Collections.Generic;
using System.Text;
using Majestic.Bot.Core;
using System.Diagnostics;
using Majestic.Util;
using Majestic.Entity.Shared;
using Majestic.Entity.ECommerce.Hungryhouse;
using Majestic.Dal.ECommerce;

namespace Majestic.Bot.Job.ECommerce
{
    public class Hungryhouse : JobRequestBase
    {
        private static string proxy;
        private static string userAgent;
        private static string domainUrl = "https://hungryhouse.co.uk/";
        private static string locationUrl = "https://hungryhouse.co.uk/takeaway";
        private int maxRetries;
        private int maxHourlyPageView;
        private HttpManager httpManager = null;
        private int pageCrawled = 0;

        /// <summary>
        /// This method needs to be defined here is primarily because we want to use the top level
        /// class name as the logger name. So even if the base class can log using the logger defined by
        /// the derived class, not by the base class itself
        /// </summary>
        /// <param name="row"></param>
        public override void Init(Majestic.Dal.Shared.DSMaj.Maj_vwJobDetailsRow row)
        {
            StackFrame frame = new StackFrame(0, false);

            base.Init(frame.GetMethod().DeclaringType.FullName, row);
        }

        /// <summary>
        /// Initializes the fields
        /// </summary>
        private void Initialize()
        {
            try
            {
                JobSettingCollection jobSettingCollection = base.GetJobSettings(JobId);
                proxy = jobSettingCollection.GetValue("proxy");
                userAgent = jobSettingCollection.GetValue("userAgent");
                maxRetries = jobSettingCollection.GetValue<int>("maxRetryTime", 3);
                maxHourlyPageView = jobSettingCollection.GetValue<int>("maxHourlyPageView", 4500);
                InithttpManager();
                InitPattern();
            }
            catch (Exception ex)
            {
                throw new MajException("Error intializing job " + m_sConfig, ex);
            }
        }

        /// <summary>
        /// Initialize the httpManager instance 
        /// </summary>
        private void InithttpManager()
        {
            if (String.IsNullOrEmpty(proxy) || proxy.Equals("none"))
            {
                throw new Exception("proxy was not set! job ended!");
            }

            httpManager = new HttpManager(proxy, this.maxHourlyPageView,
                delegate(string page)
                {
                    if (page.Contains("macys.com is temporarily closed for scheduled site improvements"))
                    {
                        return false;
                    }
                    else
                    {
                        return ComUtil.CommonValidateFun(page);
                    }
                },
                this.maxRetries);

            httpManager.SetHeader("Upgrade-Insecure-Requests", "1");
            httpManager.AcceptEncoding = "gzip, deflate, sdch";
            httpManager.AcceptLanguage = "en-US,en;q=0.8";
            httpManager.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";

            if (!string.IsNullOrEmpty(userAgent))
            {
                httpManager.UserAgent = userAgent;
            }
        }

        /// <summary>
        /// InitPattern
        /// </summary>
        private void InitPattern()
        {
            PatternContainerHelper.Load("Hungryhouse.pattern.xml");
        }

        /// <summary>
        /// The assembly entry point that controls the internal program flow. 
        /// It is called by the Run() function in the base class 
        /// <see cref="MajesticReader.Lib.JobBase"/>
        /// The program flow:
        /// 1. Get the job requests <see cref="MajesticReader.Lib.HitBoxJobRequest /> based on JobId
        /// 2. For each request, get the input parameters
        /// 3. Retrieve the Html content
        /// 4. identify and collect data based on the configration settings for the request
        /// 5. Save collected data
        /// </summary>
        protected override void OnRun()
        {
            try
            {
                Initialize();

                int jobId = base.JobId;
                Log.RunId = base.RunId;

                HungryhouseDao.RunId = RunId;
                HungryhouseDao.RunDate = DateTime.Now;

                //get current job name
                string jobName = base.GetJobName();

                //Log start time
                Log.Info("Hungryhouse Started", string.Format(
                    "Job {0} - {1} Started at {2}", jobId, jobName, DateTime.Now));

                CollectLocation();

                //Log end time
                Log.Info("Hungryhouse Finished", string.Format(
                    "Job {0} - {1} Finished at {2}. {3} pages were crawled",
                    jobId, jobName, DateTime.Now, pageCrawled));

            }
            catch (Exception ex)
            {
                // This should have never happened. So it is "Unexpeced"
                Log.Error("Unexpected/Unhandled Error", ex);
                throw new Exception("Unexpected/Unhandled Error", ex);
            }
        }

        /// <summary>
        /// CollectLocation
        /// </summary>
        private void CollectLocation()
        {
            Log.Info("Started Getting Locations", "Started Getting Locations");
            string page = DownloadPage(locationUrl);
            JobData locationData = ExtractData(page, "LocationArea", PatternContainerHelper.ToJobPatternCollection());
            JobDataCollection locationList = locationData.GetList();
            if (locationList.Count == 0)
            {
                Log.Warn("can not find locations", "can not find locations");
                return;
            }
            Log.Info("Locations", locationList.Count.ToString());
            foreach (JobData location in locationList)
            {
                string url = location.GetGroupData("Url").Value;
                string name = location.GetGroupData("Name").Value;
                if (string.IsNullOrEmpty(url) || string.IsNullOrEmpty(name))
                {
                    continue;
                }
               url = ComUtil.GetFullUrl(url, domainUrl);
               CollectRestaurant(name, url);
            }
            Log.Info("Finished Getting Locations", "Finished Getting Locations");
        }

        /// <summary>
        /// CollectRestaurant
        /// </summary>
        /// <param name="name"></param>
        /// <param name="url"></param>
        private void CollectRestaurant(string name, string url)
        {
            Log.Info("Started Getting Restaurant", string.Format("Location:{0},Url:{1}",name,url));
            string page = DownloadPage(url);
            JobData restaurantData = ExtractData(page, "RestaurantArea", PatternContainerHelper.ToJobPatternCollection());
            JobDataCollection restaurantList = restaurantData.GetList();
            if (restaurantList.Count == 0)
            {
                Log.Warn("can not find restaurant", string.Format("Location:{0},Url:{1}", name, url));
                return;
            }

            Log.Info("Restaurants", string.Format("Location:{0},Url:{1}:{2}", name, url, restaurantList.Count));
            foreach (JobData restaurant in restaurantList)
            {
                string tempUrl = restaurant.GetGroupData("Url").Value;
                string tempName = restaurant.GetGroupData("Name").Value;
                if (string.IsNullOrEmpty(tempUrl) || string.IsNullOrEmpty(tempName))
                {
                    continue;
                }
                tempUrl = ComUtil.GetFullUrl(tempUrl, domainUrl);
                CollectDetail(tempUrl, tempName);
            }
            Log.Info("Finished Getting Restaurant", string.Format("Location:{0},Url:{1}", name, url));
        }

        /// <summary>
        /// Collect detail
        /// </summary>
        /// <param name="url"></param>
        /// <param name="name"></param>
        private void CollectDetail(string url,string name)
        {
            string page = DownloadPage(url);
            Restaurant restaurant = new Restaurant();
            restaurant.Name = name;
            restaurant.Url = url;

            JobData restaurantDetailData = ExtractData(page, "RestaurantDetailArea", PatternContainerHelper.ToJobPatternCollection());
            restaurant.Address = restaurantDetailData.GetGroupData("Address").Value;
            restaurant.Postcode = restaurantDetailData.GetGroupData("Postcode").Value;
            string minimum = restaurantDetailData.GetGroupData("Minimum").Value;
            if (!string.IsNullOrEmpty(minimum) && minimum.ToLower().Contains("minimum"))
            {
                restaurant.Minimum = minimum;
            }

            try
            {
                HungryhouseDao.Instance.SaveRestaurant(restaurant);
            }
            catch (Exception ex)
            {
                Log.Error("Failed to save restaurant",url,ex);
            }
        }

        /// <summary>
        /// Downloads pages by taking sleeping time into consideration
        /// </summary>
        /// <param name="url">The url that the page is going to be downloaded from</param>
        /// <returns>The downloaded page from the specified url</returns>
        private string DownloadPage(string url)
        {
            string result = string.Empty;
            result = httpManager.DownloadPage(url);
            pageCrawled++;
            return result;
        }

    }
}
Demo
<?xml version="1.0"?>
<PatternContainer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <Patterns>
    
    <!-- LocationArea -->
    <Pattern Name="LocationArea" Description="LocationArea" HtmlSelectorExpression=".CmsRestcatCityLandingLocations">
      <SubPatterns>
        <Pattern Name="Location" Description="Location" IsList="true" Field="Name,Url">
          <Expression>
            <![CDATA[
          <li[^>]*>\s*<a[^>]*href[^"]*"(?<Url>[^"]*)"[^>]*>\s*(?<Name>[^<]*)</a>
          ]]>
          </Expression>
        </Pattern>
      </SubPatterns>
    </Pattern>

    <!-- LocationArea -->
    <Pattern Name="RestaurantArea" Description="RestaurantArea" HtmlSelectorExpression=".CmsRestcatLanding.CmsRestcatLandingRestaurants.panel.mainRestaurantsList">
      <SubPatterns>
        <Pattern Name="Restaurant" Description="Restaurant" IsList="true" Field="Name,Url">
          <Expression>
            <![CDATA[
          <li[^>]*restaurantItemInfoName[^>]*>\s*<a[^>]*href[^"]*"(?<Url>[^"]*)"[^>]*>\s*<span>\s*(?<Name>[^<]*)</span>
          ]]>
          </Expression>
        </Pattern>
      </SubPatterns>
    </Pattern>

    <!-- RestaurantArea -->
    <Pattern Name="RestaurantDetailArea" Description="Restaurant Detail Area">
      <SubPatterns>
        <Pattern Name="Address" Description="Address" Field="Address" HtmlSelectorExpression="span[itemprop=streetAddress]" />
        <Pattern Name="Postcode" Description="Postcode" Field="Postcode" HtmlSelectorExpression="span[itemprop=postalCode]" />
        <Pattern Name="Minimum" Description="Minimum" Field="Minimum">
          <Expression>
            <![CDATA[
              <div[^>]*orderTypeCond[^>]*>\s*<p>[\s\S]*?<span[^>]*>\s*(?<Minimum>[^<]*)</span>
            ]]>
          </Expression>
        </Pattern>
      </SubPatterns>
    </Pattern>    

  </Patterns>
</PatternContainer>
Demo正則

 

 作者:Dynamic-xia

 博客地址:http://www.cnblogs.com/dynamic-xia                       

 聲明:本博客以學習、研究和分享為主,歡迎轉載,但必須在文章頁面明顯位置給出原文連接。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM