Commit 1b87854f authored by gustawdaniel's avatar gustawdaniel
Browse files

Processing tables with lawyers data

parent 317b4bfa
# Scraper for rejestradwokatow.pl
How to use
1. Get pages with table (pagination with 272 pages, 100 items per page)
This command will download all pages in table to computer
```
mkdir -p raw && for i in {1..272}; do wget "https://rejestradwokatow.pl/adwokat/wyszukaj/strona/$i" -O raw/$i.html; done
```
2. Process table data
This will gets all files with tables in `raw` dir and generate `out/basic_data.json`
```
ts-node entry.ts
```
basic data has structure:
```
Output {
surname: string
name: string
second_name: string
city: string
office: string
status: string
link: string
}
```
\ No newline at end of file
import fs from 'fs';
import cheerio from 'cheerio';
import {LawyerStatus, Output} from './helpers'
const getFiles = (): string[] => fs
.readdirSync(process.cwd() + `/raw`)
.filter((name) => /^\d+\.html/.test(name))
.map(name =>
fs.readFileSync(process.cwd() + '/raw/' + name).toString()
);
const processFile = (content: string): Output[] => cheerio
.load(content)('.rejestr tbody tr')
.toArray()
.map(row => ({
surname: cheerio(row).find('td:nth-of-type(2)').text(),
name: cheerio(row).find('td:nth-of-type(3)').text().trim(),
second_name: cheerio(row).find('td:nth-of-type(4)').text(),
city: cheerio(row).find('td:nth-of-type(5)').text(),
office: cheerio(row).find('td:nth-of-type(6)').text(),
status: cheerio(row).find('td:nth-of-type(7)').text() as LawyerStatus,
link: cheerio(row).find('td:nth-of-type(8) a').attr('href') || '',
}))
const reducer = (a:Output[], b:Output[]):Output[] => [...a, ...b];
const main = () => {
return getFiles().map(processFile).reduce(reducer);
}
const out = main();
!fs.existsSync(process.cwd() + '/out') && fs.mkdirSync(process.cwd() + '/out', {recursive: true})
fs.writeFileSync(process.cwd() + '/out/basic_data.json', JSON.stringify(out))
console.dir(out)
\ No newline at end of file
export enum LawyerStatus {
active = "Wykonujący zawód",
former = "Były adwokat",
inavtive = "Niewykonujący zawodu",
undefined = ""
}
export interface Output {
surname: string
name: string
second_name: string
city: string
office: string
status: LawyerStatus
link: string
}
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment