Web Scraping for preparing training data for Machine Translation
During my internship at Birch.ai, we want to test and improve the performance of our machine translation model. To achieve this, we first need to obtain a foreign language data set and its corresponding English data set, and hope that these data sets are translated by humans rather than machine translations.
In this article, I would share my experience in scraping The New England Journal of Medicine from scratch using Python. I obtained 15,584 records in both Japanese and English. In addition, in order to preserve the original reference, I kept the URLs of different versions. You can find my Github repository for this blog here.
1. Scrape all journals in a volume with a valid page number
Since this publication is semi-annual, it releases two volumes a year. To scrape all feasible journals (that have both the English version and the Japanese version) from a volume, I wrote a function called search_page
.
def search_page(vol, page):
print(f'Scraping from volume {vol} page {page}', end = ":\t")
url = f"https://nejm.jp/abstract/vol{vol}.p{page}"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content)
results = [parag.text for parag in soup.select(".add")]
try:
new = soup.select("#sectionWrap a")[0].attrs['href']
except Exception:
print("0 training sets added.")
return []
en_page = requests.get(new, headers = headers)
en_soup = BeautifulSoup(en_page.content)
en_results = [parag.text for parag in en_soup.select("#article_Abstract .f-body")]
if not en_results:
en_results = [parag.text for parag in en_soup.select("#article_body .f-body")]
tmp = [(url, jp, new, en) for jp, en in zip(results, en_results)]
print(f"{len(tmp)} training sets added.")
return tmp
2. Search all valid page numbers in a volume
Since the page numbers of a journal are irregular, I scraped the catalogue of a journal to obtain all valid page numbers.
def search_volumn(vol, num):
url = f"https://nejm.jp/contents/idx.vol{vol}.no{num}"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content)
results = [int(re.match('/abstract/vol\d+.p(\d+)', parag.attrs['href'])[1]) for parag in soup.select(".link02 a")]
if not soup.select(".pageNum"):
raise Exception("No items found.")
return sum([search_page(vol, page) for page in results], [])
3. Scrape all journals from the start volume to the end volume
I wrote a function called search_all
to obtain all contents from feasible journals from the oldest volume to the latest one. This function can be used to scrape contents for any valid volume.
def search_all(vol_start = 336, vol_end = None):
vol, num = vol_start, 1
total_results = []
vol_results = []
while not vol_end or vol <= vol_end:
try:
vol_results += search_volumn(vol, num)
num += 1
except Exception as e:
if num == 1:
break
if vol_results:
pd.DataFrame(vol_results, columns = ['jp_url', 'jp_txt', 'en_url', 'en_txt']).to_csv(f'jp_en/vol_{vol}.csv', index = False)
total_results += vol_results
vol_results = []
vol, num = vol + 1, 1
return total_results
Final Table
I split the final table into two parts to better see it.
jp_url | jp_txt |
---|---|
https://nejm.jp/abstract/vol337.p141 | 腫瘍壊死因子(TNF)は,リウマチ性関節炎の発病に関与する炎症促進性サイトカインであるため,TNF の拮抗により,疾患の活動性が低下する可能性がある.本試験は,ヒト IgG1 の Fc 部分と結合した可溶性 TNF 受容体(p75)(TNFR:Fc)から成る遺伝子組換え型融合蛋白である新規 TNF 拮抗剤の安全性および有効性を評価した. |
https://nejm.jp/abstract/vol337.p141 | この多施設二重盲検臨床試験では,難治性リウマチ性関節炎患者 180 人を無作為割付けし,プラセボまたは TNFR:Fc の 3 用量の一つ(0.25,2,16 mg/m2 体表面積)を週 2 回,3 ヵ月間皮下注射した.アメリカリウマチ学会基準にしたがって定義した関節炎の複合症状の変化によって,臨床効果を測定した. |
en_url | en_txt |
---|---|
https://www.nejm.org/doi/full/10.1056/NEJM199707173370301 | Tumor necrosis factor (TNF) is a proinflammatory cytokine involved in the pathogenesis of rheumatoid arthritis, and antagonism of TNF may reduce the activity of the disease. This study evaluated the safety and efficacy of a novel TNF antagonist — a recombinant fusion protein that consists of the soluble TNF receptor (p75) linked to the Fc portion of human IgG1 (TNFR:Fc). |
https://www.nejm.org/doi/full/10.1056/NEJM199707173370301 | In this multicenter, double-blind trial, we randomly assigned 180 patients with refractory rheumatoid arthritis to receive subcutaneous injections of placebo or one of three doses of TNFR:Fc (0.25, 2, or 16 mg per square meter of body-surface area) twice weekly for three months. The clinical response was measured by changes in composite symptoms of arthritis defined according to American College of Rheumatology criteria. |