Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
What's new
7
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Switch to GitLab Next
Sign in / Register
Toggle navigation
Open sidebar
gustawdaniel
lawyers-scraper
Commits
1b87854f
Commit
1b87854f
authored
Feb 17, 2021
by
gustawdaniel
Browse files
Processing tables with lawyers data
parent
317b4bfa
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
87 additions
and
0 deletions
+87
-0
README.md
README.md
+33
-0
entry.ts
entry.ts
+38
-0
helpers.ts
helpers.ts
+16
-0
No files found.
README.md
0 → 100644
View file @
1b87854f
# Scraper for rejestradwokatow.pl
How to use
1.
Get pages with table (pagination with 272 pages, 100 items per page)
This command will download all pages in table to computer
```
mkdir -p raw && for i in {1..272}; do wget "https://rejestradwokatow.pl/adwokat/wyszukaj/strona/$i" -O raw/$i.html; done
```
2.
Process table data
This will gets all files with tables in
`raw`
dir and generate
`out/basic_data.json`
```
ts-node entry.ts
```
basic data has structure:
```
Output {
surname: string
name: string
second_name: string
city: string
office: string
status: string
link: string
}
```
\ No newline at end of file
entry.ts
View file @
1b87854f
import
fs
from
'
fs
'
;
import
cheerio
from
'
cheerio
'
;
import
{
LawyerStatus
,
Output
}
from
'
./helpers
'
const
getFiles
=
():
string
[]
=>
fs
.
readdirSync
(
process
.
cwd
()
+
`/raw`
)
.
filter
((
name
)
=>
/^
\d
+
\.
html/
.
test
(
name
))
.
map
(
name
=>
fs
.
readFileSync
(
process
.
cwd
()
+
'
/raw/
'
+
name
).
toString
()
);
const
processFile
=
(
content
:
string
):
Output
[]
=>
cheerio
.
load
(
content
)(
'
.rejestr tbody tr
'
)
.
toArray
()
.
map
(
row
=>
({
surname
:
cheerio
(
row
).
find
(
'
td:nth-of-type(2)
'
).
text
(),
name
:
cheerio
(
row
).
find
(
'
td:nth-of-type(3)
'
).
text
().
trim
(),
second_name
:
cheerio
(
row
).
find
(
'
td:nth-of-type(4)
'
).
text
(),
city
:
cheerio
(
row
).
find
(
'
td:nth-of-type(5)
'
).
text
(),
office
:
cheerio
(
row
).
find
(
'
td:nth-of-type(6)
'
).
text
(),
status
:
cheerio
(
row
).
find
(
'
td:nth-of-type(7)
'
).
text
()
as
LawyerStatus
,
link
:
cheerio
(
row
).
find
(
'
td:nth-of-type(8) a
'
).
attr
(
'
href
'
)
||
''
,
}))
const
reducer
=
(
a
:
Output
[],
b
:
Output
[]):
Output
[]
=>
[...
a
,
...
b
];
const
main
=
()
=>
{
return
getFiles
().
map
(
processFile
).
reduce
(
reducer
);
}
const
out
=
main
();
!
fs
.
existsSync
(
process
.
cwd
()
+
'
/out
'
)
&&
fs
.
mkdirSync
(
process
.
cwd
()
+
'
/out
'
,
{
recursive
:
true
})
fs
.
writeFileSync
(
process
.
cwd
()
+
'
/out/basic_data.json
'
,
JSON
.
stringify
(
out
))
console
.
dir
(
out
)
\ No newline at end of file
helpers.ts
0 → 100644
View file @
1b87854f
export
enum
LawyerStatus
{
active
=
"
Wykonujący zawód
"
,
former
=
"
Były adwokat
"
,
inavtive
=
"
Niewykonujący zawodu
"
,
undefined
=
""
}
export
interface
Output
{
surname
:
string
name
:
string
second_name
:
string
city
:
string
office
:
string
status
:
LawyerStatus
link
:
string
}
\ No newline at end of file
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment