vmx.uber.space Open in urlscan Pro
2a00:d0c0:200:0:c8a0:96ff:fe2c:2042  Public Scan

Submitted URL: https://vmx.uber.space/
Effective URL: https://vmx.uber.space/cgi-bin/blog/index.cgi
Submission: On July 05 via automatic, source certstream-suspicious — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

VMX

the blllog.




ABOUT ME

My name is Volker Mische and I'm an open source enthusiast and hacker. You can
reach me via email, on Twitter (@vmx) or IRC (as vmx). Find me also on GitHub.


CATEGORIES

 * CouchDB (20)
 * Couchbase (3)
 * EU (2)
 * Erlang (6)
 * Festival (6)
 * Freifunk (1)
 * GeoCouch (10)
 * IPFS (2)
 * IPLD (1)
 * JavaScript (11)
 * Kino (4)
 * Leben (4)
 * MapQuery (5)
 * Musik (2)
 * Node (2)
 * Noise (6)
 * OpenLayers (3)
 * ProtocolLabs (1)
 * Python (9)
 * RocksDB (2)
 * Rust (6)
 * TileCache (1)
 * Verschiedenes (4)
 * WASM (1)
 * Wjt (8)
 * climatechange (1)
 * conference (10)
 * copyright (2)
 * de (27)
 * en (49)
 * festival (1)
 * film (1)
 * funding (2)
 * geo (29)
 * geoyoga (1)
 * life (1)
 * misc (1)
 * npm (1)
 * party (1)
 * politics (2)
 * psychogeography (1)
 * tutorial (1)


ARCHIVES

 * 2023-07 (1)
 * 2021-06 (1)
 * 2021-01 (1)
 * 2019-08 (1)
 * 2019-06 (1)
 * 2019-04 (1)
 * 2019-03 (1)
 * 2018-01 (1)
 * 2017-12 (3)
 * 2017-10 (2)
 * 2017-09 (3)
 * 2017-02 (1)
 * 2016-11 (1)
 * 2016-07 (1)
 * 2015-02 (1)
 * 2014-09 (1)
 * 2013-10 (1)
 * 2012-10 (1)
 * 2012-06 (1)
 * 2012-05 (1)
 * 2012-04 (1)
 * 2012-01 (1)
 * 2011-09 (1)
 * 2011-05 (1)
 * 2011-04 (1)
 * 2010-07 (2)
 * 2010-06 (1)
 * 2010-05 (3)
 * 2010-02 (1)
 * 2009-12 (1)
 * 2009-11 (3)
 * 2009-10 (1)
 * 2009-09 (2)
 * 2009-08 (1)
 * 2009-07 (3)
 * 2009-05 (1)
 * 2009-04 (2)
 * 2009-03 (1)
 * 2009-02 (2)
 * 2009-01 (2)
 * 2008-11 (2)
 * 2008-10 (2)
 * 2008-09 (1)
 * 2008-08 (1)
 * 2008-07 (9)
 * 2008-06 (3)
 * 2008-05 (1)


FOSS4G 2023

2023-07-22 21:50

Finally, after missing one virtual and one in person global FOSS4G I had again
the chance to attend a global in-person FOSS4G conference. Thanks Protocol Labs
for sending me. This year it was in Prizren, Kosovo. I’m a bit late with that
post, but that’s due to doing some hiking in Albania right after the conference.


THE ORGANIZATION AND VENUE

Wow. It’s been my favourite venue of all FOSS4Gs I’ve been to so far. The
exhibition hall was a great place to hang out, combined with the excellent idea
of a 24h bar. I’m not sure if it was used at all times, but definitely for more
than 20h a day. Outside, there was plenty of space and tables to hang out, and
very close by another set of tables that formed the “work area”. Which was
another great place to hang out, with enough power sockets and shade for the hot
days.

The main stage was an open air stage with enough seating for everyone. It was
converted for the gala dinner to a stage with an excellent live band and the
usual big round tables.

For me, the best part was that even the accommodation was on-site. The barracks
of the former military basis, which now serve as student dorms, were our home
for a week. Pretty spartan, but at a conference I don’t really spend much time
in my room, I mostly need just some place to sleep.

Having everything, the talks, exhibition, social events and accommodations
on-site makes it easy to maximize the time for socializing, which for me is the
number one reason to attend a conference.

Everything was well organized, and it was great to see so many volunteers
around.


THE TALKS

I haven’t really selected the talks I went to. I rather joined others where they
were going, or listened to recommendations. Often, I just stayed in the rest of
the slot to see what else is there. My favourite talks were:

 * Smart Maps for the UN and All - keeping web maps open: For me, it was the
   first time I saw someone speaking at a FOSS4G about using IPFS that wasn’t
   me. It’s great to see that it gains traction for the offline use case, where
   it just makes a lot of sense. UN Smart Maps is part of the UN OpenGIS
   initiative, it features a wide range of things, even an AI chatbot called
   TRIDENT that transforms the text into Overpass API calls. Try TRIDENT it out
   yourself, when you open the developer console, you can see the resulting
   Overpass API calls.
 * Offline web map server “UNVT Portable”: This talk got into more detail about
   using Raspberry Pis to have map data stored in IPFS for offline use. It’s
   very similar to what I envision, the only difference is that I’d also like to
   keep the storage in the browser. But I surely see a future, where those
   efforts are combined, to have a small easy server you can deploy, with in
   browser copies of subsets of the data to be able to work completely offline
   in the field. The original UNVT Portable repository doesn’t use IPFS, but
   Smart Maps Bazaar does, which seems to be its successor.
 * B6, Diagonal’s open source geospatial analysis engine: A presentation of the
   B6 tool for geospatial analysis for urban planning. It has a beautiful
   interface. I really like the idea of doing things directly on the map in a
   notebook-style way, where you perform certain steps after each other.
 * Elephant in the room: A talk about how many resources to computations take?
   Do we always need it? It’s very hard, often impossible, to find out how
   environmentally friendly some cloud services are. One of the conclusions was
   that cheaper providers likely use less power, hence are harming the
   environment less. I would like if there would be better ways (e.g. it misses
   things like economies of scale of large providers), but I agree that this
   might be the best metric we currently have. And I also hope there will be
   more economic pressure to save resources.
 * There was a closing keynote from Kyoung-Soo Eom, who was talking about his
   long journey in open source GIS, but also his history with Kosovo, where he
   was also on a mission in 1999. Quite inspiring.


MY TALK

My talk about Collaborative mapping without internet connectivity was about a
browser based offline-first prototype that uses IPFS to enable replication to
other peers. The project is called Colleemap and is dual-licensed under the MIT
and Apache 2.0 license. Although I tried the demo bazillion times before my
talk, it sadly didn’t work during my talk. Though, trying it later with various
people, I was able to get 4 peers connected once. I even saw it working on a
Windows machine. So it really works cross-platform.

For the future I hope to work closer with the people from the UN OpenGIS
initiative, it would be great to combine it with their Raspberry Pi based
prototype.


THINGS I’VE LEARNT

The Sentinel-2 satellite imagery is available from multiple sources, directly
from Copernicus Open Access Hub or through cloud providers like AWS, Azure of
Google Cloud. From the cloud providers you only get the level-2 data. They might
use the original level-2 data or do their own atmospheric correction based on
the level-1 data. Or even re-encode the data. So it’s hard to tell which kind of
data you actually get.

As far as I know (please let me know if I’m wrong), there isn’t any mirror of
the full level-1c data. You can only get it through the Copernicus Open Access
Hub and there the older images are stored in the long term archive on tape,
where it can take up to 24h for the data to be available for download (if it
works).

Ideally, there would be a mirror of the full level-1c data (where the ESA would
provide checksums of their files) and a level-2 version, where the exact process
is openly published, so that you can verify how it was created. The problem is
the storage cost. The current level-2 data is about 25 PiB, which leads to
storage costs of over $500k USD a month if you would store it on AWS S3 Standard
at the current pricing (I used the $0.021 per GB).


FINAL THOUGHTS

It was great to meet Gresa and Valmir from the local organizing committee before
the FOSS4G in March at the OSGeo German language chapter conference FOSSGIS in
Berlin. That made it easy for me to connect to the event right from the start.
If there’s one thing future FOSS4Gs should adapt, it’s the cheap on-site (or
close by) accommodation. I think that shared bathrooms is also much smoother to
have, if you know that everyone in the accommodation is from the conference. We
had something similar with the BaseCamp in Bonn during the FOSS4G 2016 and the
international code spring in 2018 during the FOSSGIS conference, where the whole
place was rented for the time of the events.

Though, of course, I also missed some of my longtime FOSS4G friends I hadn’t
seen in a long time. I hope you’re all doing well and will meet again soon.

2 Comments

Categories: en, IPFS, conference, geo


VIDEO UPLOADS FOR AN ONLINE CONFERENCE

2021-06-12 16:35

This blog post should give some insights on what happens behind the scenes in
preparation of an online conference, and I also hope that some of the scripts I
created might be useful for others as well. We were using pretalx for the
submissions and Seafile for video uploads. Both systems are accessed over their
HTTP API.

This year’s FOSSGIS 2021 conference was a pure online conference. Though it had
the same format as every year. Three days of conference, with four tracks in
parallel. This leads to about 100 talks. I joined the organizing team about 10
weeks before the conference took place. The task sounded easy. The speakers
should be able to upload their talks prior to the conference, so that during the
conference less could go wrong.

All scripts are available at https://github.com/vmx/conference-tools licensed
under the MIT License.


THE SOFTWARE

The speakers submitted their talks through pretalx, a conference management
system I highly recommend. It is open source and has an active community. I’ve
worked on/with it over the past few to make it suitable for OSGeo conferences.
The latest addition is the public community voting plugin, which has been used
for the FOSS4G 2021 as well as this conference. pretalx has a great HTTP API to
get data out of the system. It doesn’t yet have much support for manipulating
the data, but pull-requests are welcome.

For storing the video files, Seafile was used. I haven’t had any prior
experience with it. It took me a bit to figure out, that the Python API is for
local access only and that the public API is a pure HTTP API. You can clearly
see that their API is tailored to their use in their web interface and not
really designed for third party usage. Nonetheless, it guarantees that you can
do everything via the HTTP API, that can be done through the web UI.

My scripts are heavily based on command line tools like b2sum, curl, cut, jq and
jo, hence a lot of shell is used. For more complex data manipulation, like
merging data, I use Python.


THE TASK

The basic task is providing pre-recorded videos for a conference that were
uploaded by the speakers themselves. The actual finer grained steps are:

 * Sending the speakers upload links
 * Looking through the videos to make sure they are good
 * Re-organizing the files suitable to be played back according to the schedule
 * Make the final files easily downloadable
 * Create a schedule which lists the live/pre-recorded talks

SENDING UPLOAD LINKS

In Seafile you can create directories and make them publicly available so that
people can upload files. Once uploaded, you won’t see what else in that
directory. In order to be able to easily reference the uploaded videos back to
the corresponding talk, it was important to create one dedicated directory per
talk, as you won’t know which filenames people will use for their videos.

The speakers will receive an email containing dedicated upload links for each of
their talks. See the email_upload_links directory for all the scripts that are
needed for this step.

PRETALX

First you need to get all the talks. In pretalx that’s easy, go to your
conference, e.g. https://pretalx.com/api/events/democon/submissions/. We only
care about the accepted talks, which can be done with selecting a filter. If you
access it through curl, you’ll get a JSON response like that one:
https://pretalx.com/api/events/democon/submissions/?format=json. pretalx returns
25 results per request. I’ve created a script called pretalx-get-all.py that
automatically pages through all the results and concatenates them.

A talk might be associated with multiple speakers. Each speaker should get an
email with an upload link. There were submissions that are not really talks in
the traditional sense, so people shouldn’t get an email. The query for jq looks
like that:

[.results[] | select((.submission_type[] | contains("Workshop")) or
(.submission_type[] == "Anwendertreffen / BoF") | not) | { code: .code, speaker:
.speakers[].code, title: .title, submission_type: .submission_type[]}]

The submissions contain only the speaker IDs and names, but not other details
like their email address. We query the speakers API (e.g.
https://pretalx.com/api/events/democon/speakers/) and post-process the data
again with jq, as we care about their email addresses.

You can find all the requests and filter in the
email_upload_links/upload_talks_to_seafile.sh script.

SEAFILE

Creating and upload link is a two-step process in Seafile. First create the
directory, then creating a public accessible upload link for the directory. The
directories are named after the pretalx ID of the talk (Full script for creating
directories).

CREATING EMAILS

After acquiring the data, the next step is to process the data and creating the
individual emails. Combining the data is done with the
combine_talks_speakers_upload_links.py script, where the output is again
post-processed with jq. The data_to_email.py script takes that data output and a
template file to create the actual email as files. The template file is used as
a Python format string, where the variables a filled with the data provided.

Those email files are then posted to pretalx, so that we can send them over
their email system. That step is more complicated as currently there is no API
in pretalx to do that. I logged in through the web interface and manually added
a new email, while having the developer tools open. I then copied the POST
request “as cURL” to have a look at the data it sent. There I manually extracted
the session and cookie information in order to add emails from the command line.
The script that takes the pre-generated emails and puts them into pretalx is
called email_to_pretalx.sh.


REVIEWING THE UPLOADED VIDEOS

Once a video is uploaded, it gets reviewed. The idea was, that the speakers
don’t need to care too much about the start and the end of the video, e.g. when
they start the recording and there is a few seconds of silence while switching
to the presentation. The reviewer will cut the beginning and end of the video
and also convert it to a common format.

We wanted to preserve the original video quality, hence we use LosslessCut and
converted it then to the Matroska format. The reviewers would also check that
the video isn’t longer than the planned slot.

See the copy_uploads directory for all the scripts that are needed for this
step.

PRETALX

The reviewers get a file with things to check for each video file. We get the
needed metadata again from pretalx and post-process it with jq. As above for the
emails, there is again a template file which (this time) generates Markdown
files with the information for the reviewers. The full script is called
create_info_files.sh.

SEAFILE

Once videos are uploaded they should be available for the reviewers. The
uploaded files are the primary source, hence it makes sense to always make
copies of the talks, so that the original uploads are not lost. The
sync_files_and_upload_info.sh script copies the talks into a new directory
(together with the information files), which is then writeable for the
reviewers. They will download the file, review it, cut it if needed, convert it
to Matroska and upload it again. Once uploaded, they move the directory into one
called fertig (“done” in German) as an indicator that no one else needs to
review it.

I run the script daily as a cron job, it only copies the new uploads. Please
note that it only checks the existence on a directory level. This means that if
a talk was reviewed and a speaker uploads a new version of the talk, it won’t be
copied. That case didn’t often happen often and speakers actually let us know
about it, so it’s mostly a non-issue (also see the miscellaneous scripts section
for more).

Last step is that someone looks through the filled out markdown files to check
if everything was alright, respectively make sure that e.g. the audio volume is
fixed, or asks the speaker for a new upload. The then checked videos are moved
to yet another directory, which then contains all the talks that are ready to be
streamed.


RE-ORG FILES FOR SCHEDULE

So far, the video files were organized by directories that are named after the
pretalx ID of the talk. For running the conference we used OBS for streamer. The
operator would need to play the right video at the right time. Therefore, it
makes sense to sort them by the schedule. The cut_to_schedule.sh script does
that re-organization, which can be found in the cut_to_schedule directory.

PRETALX

To prevent accidental inconsistencies, the root directory is named after the
current version of the pretalx schedule. So if you publish a new version of the
schedule and run the script again, you’ll get a new directory structure. The
video files still have an arbitrary name, chosen by the uploader/reviewer, we
want a common naming scheme instead. The get_filepath.py script creates such a
name that also sorts chronologically and contains all the information the OBS
operators need. The current scheme is
<room>/day<day-of-the-conference>/day<day-of-the-conference>_<day-of-the-week>_<date>_<time>_<pretalx-id>_<title>.mkv.

SEAFILE

The directories do not only contain the single final video, but also the
metadata and perhaps the original video or a presentation. The file we actually
copy is the *.mkv file which was modified last, which will be the cut video. The
get_files_to_copy.sh script creates a list of the files that should be copied,
it will only list the files that weren’t copied yet (based on the filename). The
copy_files.sh script does the actual copying and is rather generic, it only
depends on a file list and Seafile.


EASILY DOWNLOADABLE FILES

Seafile has a feature to download a full directory as zip file. I originally
planned to use that. It turns out that the size of the files can be too large, I
got the error message Unable to download directory "day1": size is too large..
So I needed to provide another tool, as I didn’t want that people would need to
click and download all individual talks.

The access to the files should as easy as possible, i.e. the operators that need
the files shouldn’t need a Seafile account. As the videos also shouldn’t be
public, the compromise was using a download link secured with a password. This
means that an authentication step is needed, which isn’t trivial. The
download_files.sh script does login and then downloads all the files in that
directory. For simplicity, it doesn’t do recursively. This means that any stage
would need to run this script for each day.

I also added a checksum check for more robustness. I created those checksums
manually with running b2sum * > B2SUMS in each of the directories and then
uploaded them to Seafile.


LIST OF LIVE/PRE-RECORDED TALKS

Some talks are recorded and some are live, the list_recorded_talks.py script,
creates a Markdown file that contains a schedule with that information,
including the lengths of the talks if they are pre-recorded. This is useful for
the moderators to know how much time for questions will be. At the FOSSGIS we
have 5 minutes for questions, but if the talk runs longer, there will be less
time.

You need the schedule and the length of the recorded talks. This time I haven’t
fully automated the process, it’s a bit more manual than the other steps. All
scripts can be found in the list_recorded_talks directory.

Get the schedule:

curl https://pretalx.com/<your-conference>/schedule.json > schedule.json


For getting the lengths of the videos, download them all with the download
script from the Easily downloadable files section above. Then run the
get_length.sh script in each of the directories and output then into a file. For
example:

cd your-talks-day1
/path/to/get_lengths.sh > ../lengths/day1.txt


Then combine the lengths of all days into a single file:

cat ../lengths/*.txt > ../talk_lengths.txt


Now you can create the final schedule:

cd ..
python3 /path/to/list_recorded_talks.py schedule.json talk_lengths.txt


Here’s a sample schedule from the FOSSGIS 2021.


MISCELLANEOUS SCRIPTS

SPEAKER NOTIFICATION

The speakers didn’t get feedback whether their video was correctly
uploaded/processed (other than seeing a successful upload in Seafile). A short
time before the conference, we were sending out the latest information that
speakers needs to know. We decided to take the chance to also add information
whether their video upload was successful or not, so that they can contact us in
case something with the upload didn’t go as they expected (there weren’t any
issues :).

It is very similar to sending out the email with the upload links. You get the
information about the speakers and talks in the same way. The only difference is
we now also need the information whether the talk was pre-recorded or not. We
get that from Seafile:

curl --silent -X GET --header 'Authorization: Token <seafile-token>' 'https://seafile.example.org/api2/repos/<repo-id>/?p=/<dir-with-talks>&t=d'|jq --raw-output '.[].name' > prerecorded_talks.txt


The full script to create the emails can be found at email_speaker_final.sh. In
order to post them to pretalx, you can use the email_to_pretalx.sh script and
follow the description in the creating emails section.

NUMBER OF UPLOADS

It could happen that people upload a new version of the talk. The current
scripts won’t recognize that if a previous version was already reviewed. Hence,
I manually checked the directories for the ones with more than one file in it.
This can easily be done with a single curl command to the Seafile HTTP API:

curl --silent -X GET --header 'Authorization: Token <seafile-token>' 'https://seafile.example.org/api2/repos/<repo-id>/dir/?p=/<dir-with-talks>&t=f&recursive=1'|jq --raw-output '.[].parent_dir'|sort|uniq -c|sort


The output is sorted by the number of files in that directory:

  1 /talks_conference/ZVAZQQ
  1 /talks_conference/DXCNKG
  2 /talks_conference/H7TWNG
  2 /talks_conference/M1PR79
  2 /talks_conference/QW9KTH
  3 /talks_conference/VMM8MX


NORMALIZE VOLUME LEVEL

If the volume of the talk was too low, it was normalized. I used
ffmpeg-normalize for it:

ffmpeg_normalize --audio-codec aac --progress talk.mkv



CONCLUSION

Doing all this with scripts was a good idea. The less manual work the better. It
also enabled me to process talks even during the conference in a semi-automated
way. I created lots of small scripts and sometimes used just a subset of them,
e.g. the copy_files.sh script, or quickly modified them to deal with a special
case. For example, all lightning talks of a single slot (2-4) were merged
together into one video file. That file of course then isn’t associated with a
single pretalx ID any more.

During the conference, the volume level of the pre-recorded talks was really
different. I think for next time I’d like to do some automated audio level
normalization after the people have uploaded the file. It should be done before
reviewers have a look, so that they can report in case the normalization broke
the audio.

The speakers were confused whether the upload really worked. Seafile doesn’t
have an “upload now” button or so, it does it’s JavaScript magic once you’ve
selected a file. That’s convenient, but was also confusing me, when I used it
for the first time. And if you reload the page, you also won’t see that
something was uploaded already. So perhaps it could also be automated that
speakers get an email “we received your upload” or so.

Overall I’m really happy how the whole process went, there weren’t major
failures like lost videos. I also haven’t heard any complaints from the people
that needed to use any of the videos at any stage of the pipeline. I’d also like
to thank all the speakers that uploaded a pre-recorded video, it really helped a
lot running the FOSSGIS conference as smooth as it was.

No comments

Categories: en, conference, geo


WEBASSEMBLY MULTI-VALUE RETURN IN TODAY'S RUST WITHOUT WASM-BINDGEN

2021-01-29 15:00

The goal was to run some WebAssembly within different host languages. I needed a
WASM file that is independent of the host language, hence I decided to code the
FFI manually, without using any tooling like wasm-bindgen, which is JavaScript
specific. It needed a bit of custom tooling, but in the end I succeeded in
having a WASM binary that has a multi-value return, generated with today's Rust
compiler, without using wasm-bindgen annotations.


INTRODUCTION

In my case I wanted to pass some bytes into the WASM module, do some processing
and returning some other bytes. I found all information I needed in this
excellent A practical guide to WebAssembly memory from radu. There he mentions
the WebAssembly multi-value proposal and links to a blog post from 2019 called
Multi-Value All The Wasm! which explains its implementation for the Rust
ecosystem.

As it's from 2019 I just went ahead and thought I can use multi-value returns in
Rust.


THE JOURNEY

My function signature for the FFI looks like this:

pub extern "C" fn decode(data_ptr: *const u8, data_len: usize) -> (*const u8, usize) { … }


When I compiled it, I got this warning:

warning: `extern` fn uses type `(*const u8, usize)`, which is not FFI-safe
 --> src/lib.rs:2:67
  |
2 | pub extern "C" fn decode(data_ptr: *const u8, data_len: usize) -> (*const u8, usize) {
  |                                                                   ^^^^^^^^^^^^^^^^^^ not FFI-safe
  |
  = note: `#[warn(improper_ctypes_definitions)]` on by default
  = help: consider using a struct instead
  = note: tuples have unspecified layout


Multi-value returns are certainly not meant for C APIs, but for WASM it might
still work, I thought. Running wasm2wat shows:

(module
  (type (;0;) (func (param i32 i32 i32)))
  (func $decode (type 0) (param i32 i32 i32)
…


This clearly isn't a multi-value return. It doesn't even have a return at all,
it takes 3 parameters, instead of the 2 the function definition has. I found an
issue called Multi value Wasm compilation #73755 and was puzzled why it doesn't
work. Is this a regression? Why did it work in that blog post from 2019? I gave
the Multi-Value All The Wasm! blog post another read, and it turns out it
explains all this in detail (look at the wasm-bindgen section). Back then it
wasn't supported by the Rust compiler directly, but by wasm-bindgen.

So perhaps I can just use the wasm-bindgen command line tool and transform my
compiled WASM binary into a multi-value return one. There is a command-line flag
called WASM_BINDGEN_MULTI_VALUE=1 to enable that transformation. Sadly that
doesn't really work as it needs some interface-types present in the WASM binary
(which I don't have).

Thanks to open source, the blog post about the implementation of the
transformation feature and some trial an error, I was able to extract the pieces
I needed and created a tool called wasm-multi-value-reverse-polyfill. I didn't
need to do any of the hard parts, just some wiring up. I was now able to
transform my WASM binary into a multi-value return one simply by running:

$ multi-value-reverse-polyfill ./target/wasm32-unknown-unknown/release/wasm_multi_value_retun_in_rust.wasm 'decode i32 i32'
Make `decode` function return `[I32, I32]`.


The WAT disassembly now looks like that:

  (type (;0;) (func (param i32 i32) (result i32 i32)))
  (type (;1;) (func (param i32 i32 i32)))
  (func $decode_multivalue_shim (type 0) (param i32 i32) (result i32 i32)
…


There you go. There is now a shim function that has the multi-value return,
which calls the original method. I can now use my newly created WASM binary with
WebAssembly runtimes that support multi-value returns (like Wasmer or Node.js).


CONCLUSION

With wasm-multi-value-reverse-polyfill I'm now able to create multi-value return
functions with the current Rust compiler without depending on all the magic
wasm-bindgen is doing.

No comments

Categories: en, WASM, Rust


WHEN NPM LINK FAILS

2019-08-01 22:35

There are cases where linking local packages don't produce the same result as if
you would've installed all packages from the registry. Here I'd like to tell the
story about one of those real world cases and conclude with a solution to those
problems.


THE PROBLEM

When you do an npm install heavy module deduplication and hoisting, which
doesn't always behave the same way in all cases. For example if you npm link a
package, the resulting node_modules tree is different. This may lead to
unexpected runtime errors.

It happened to me recently and I thought I use exactly this real world example
to illustrate that problem and a possible solution to it.


REAL WORLD EXAMPLE

PREPARATIONS

Start with cloning the js-ipfs-mfs and js-ipfs-unixfs-importer repository:

$ git clone https://github.com/ipfs/js-ipfs-mfs --branch v0.12.0 --depth 1
$ git clone https://github.com/ipfs/js-ipfs-unixfs-importer --branch v0.39.11 --depth 1


Our main module is js-ipfs-mfs and let's say you want to make local changes to
js-ipfs-unix-importer, which is a direct dependency of js-ipfs-mfs.

First of all you of course make sure that currently the tests pass (we just run
a subset, to get to the actual issue faster). I'm sorry that the installation
takes so long and so much space, the dev dependencies are quite heavy.

$ cd js-ipfs-mfs
$ npm install
$ npx mocha test/write.spec.js
…
  53 passing (4s)
  1 pending


Ok, all tests passed.

REPRODUCING THE ISSUE

Before we even start modifying js-ipfs-unix-importer, we link it and check that
the tests still pass.

$ cd js-ipfs-unixfs-importer
$ npm link
$ cd ../js-ipfs-mfs
$ npm link ipfs-unixfs-importer
$ npx mocha test/write.spec.js
…
  37 passing (2s)
  1 pending
  16 failing
…


Oh, no. The tests failed. But why? The reason is deep down in the code. The root
cause is in the [hamt-sharding] module and it's not even a bug. It just checks
if something is a Bucket:

static isBucket (o) {
  return o instanceof Bucket
}


instanceof only works if both instances we check on came from the exact same
module. Let's see who is importing the hamt-sharding module:

$ npm ls hamt-sharding
ipfs-mfs@0.12.0 /home/vmx/misc/protocollabs/blog/when-npm-link-fails/js-ipfs-mfs
├── hamt-sharding@0.0.2
├─┬ ipfs-unixfs-exporter@0.37.7
│ └── hamt-sharding@0.0.2  deduped
└─┬ UNMET DEPENDENCY ipfs-unixfs-importer@0.39.11
  └── hamt-sharding@0.0.2  deduped

npm ERR! missing: ipfs-unixfs-importer@0.39.11, required by ipfs-mfs@0.12.0


Here we see that ipfs-mfs has a direct dependency on it, and an indirect
dependency through ipfs-unixfs-exporter and ipfs-unixfs-importer. All of them
use the same version (0.0.2), hence it's deduped and the instanceof call should
work. But there's also an error about an UNMET DEPENDENCY, the
ipfs-unixfs-importer module we linked to.

To make it clear what's happening inside Node.js. When you
require('hamt-sharding') from the ipfs-mfs code base, it will load the module
from the physical location js-ipfs-mfs/node_modules/hamt-sharding. When you
require it from ipfs-unixfs-importer it will be loaded from
js-ipfs-mfs/node_modules/ipfs-unixfs-importer/node_modules/hamt-sharding resp.
from ipfs-unixfs-importer/node_modules/hamt-sharding, as
js-ipfs-mfs/node_modules/ipfs-unixfs-importer is just a symlink to a symlink to
that directory.

When you do a normal installation without linking, you won't have this issue as
hamt-sharding will be properly deduplicated and only loaded once from
js-ipfs-mfs/node_modules/hamt-sharding.


POSSIBLE WORKAROUNDS THAT DO NOT WORK

Though you still like to change ipfs-unixfs-importer locally and test those
changes with ipfs-mfs without breaking anything. I had several ideas on how to
workaround this. I start with the ones that didn't work:

 1. Just delete the js-ipfs-unixfs-importer/node_modules/hamt-sharding
    directory. The module should still be found in the resolve paths of
    ipfs-mfs. No it doesn't. Tests fail because hamt-sharding can't be found.
 2. Global linking runs an npm install when you run the initial npm link. What
    if we remove the js-ipfs-unixfs-importer/node_modules completely and symlink
    to the module manually. That also doesn't work, the hamt-sharding module
    also can't be found.
 3. Install ipfs-unixfs-importer directly with a relative path (npm install
    ../js-ipfs-unixfs-importer). No, that doesn't work either, it will still
    have its own node_modules/hamt-sharding, it won't be properly deduplicated.

There must be a way to make local changes to a module and testing them without
publishing it each time. Luckily there really is.


WORKING WORKAROUND

I'd like to thank my colleague Hugo Dias for this workaround that he has been
using for a while already.

You can just replicate what a normal npm install <package> would be doing. You
pack the module and then install that packed package. In our case that means:

$ cd js-ipfs-mfs
$ npm pack ../js-ipfs-unixfs-importer
…
ipfs-unixfs-importer-0.39.11.tgz
$ npm install ipfs-unixfs-importer-0.39.11.tgz
…
+ ipfs-unixfs-importer@0.39.11
added 59 packages from 76 contributors and updated 1 package in 31.698


Now all tests pass.

This is quite a manual process. Luckily Hugo created a module to automate
exactly that workflow. It's called connect-deps.


CONCLUSION

Sometimes linking packages doesn't create the same structure of modules and you
need to use packing instead. To automate this you can use connect-deps.

No comments

Categories: en, JavaScript, npm


SHOW YOUR OWN STRIPES

2019-06-20 22:35

You want to create #ShowYourStripes for the location you live in? Here's how.


INTRO



When I first saw #ShowYourStripes I immediately felt in love (thanks Stefan Münz
for tweeting about it). I think it's a great and simple visualization by Ed
Hawkins of what we are currently facing when it comes to climate change. You
don't need to scroll through long tables or figure out the axis on some diagram.
You can simply see that there's something massively changing.

After playing around a bit with the cities available on the #ShowYourStripes
website I wanted to do the same for the city I live in, Augsburg, Germany. I
looked at the website's source code first, in hope that it dynamically creates
the data from some JSON or so. That isn't the case. I then searched Twitter,
GitHub and the Web if I can find any related open source project. I wouldn't
want to spend time figuring out the parameters that were used to create those.
After all, I wanted mine to look exactly like those.

Luckily I found a Tweet from Zeke Hausfather saying that he could create those.
I then asked him if he could please release the source code. And just 7h later
he did.


CREATING YOUR OWN STRIPES

Now it's time for a quick tutorial on how you can create your own
#ShowYourStripes with that source code.

PREREQUISITES

I did those steps on a Debian system that had the most common tools installed
(like Python3, or Wget). I'm using Pipenv for installing the required Python
packages, but you can use any other package management tool for Python.

Let's get the data file with the global temperature values first. It's 200MB so
it might take a while.

wget http://berkeleyearth.lbl.gov/auto/Global/Gridded/Complete_TAVG_LatLong1.nc


Now retrieve the source code:

$ wget https://raw.githubusercontent.com/hausfath/scrape_global_temps/master/City%20Warming%20Strips%20.ipynb -O showyourstripes.ipynb


In order to run the script, we need to get a few Python packages first:

$ pipenv install matplotlib nbconvert netcdf4 numpy_indexed pandas



RUNNING THE SCRIPT

The original script is a Jupyter Notebook, so we convert it to a plain Python
script (you can ignore the warnings):

$ pipenv run jupyter-nbconvert --to python showyourstripes.ipynb


Next we need to make some changes to the showyourstripes.py file so that it
works on your machine and plots the stripes for your location. We work on the
current directory, so you can comment out changing the directory:

#os.chdir('/Users/hausfath/Desktop/Climate Science/GHCN Monthly/')


The other changes we need is the location the stripes should be plotted from.
Here I use the values for Augsburg, Germany. Use your own values there. When I
don't know the coordinates of my location, I usually check Wikipedia. In the top
right corner of an article you can find the coordinate of a place (if it has one
attached). If you click on those you get to the GeoHack page of the article.
There on the top right you can find the coordinate in decimals in lat/lon order.
In my case it's "48.366667, 10.9".

savename = 'augsburg'

lat = 48.366667
lon = 10.9


Now you're ready to run the script:

pipenv run python showyourstripes.py


Now you should have an output file called augsburg.png in the same directory
which contains the stripes.


CONCLUSION

Have fun creating your own #ShowYourStripes. Thanks again Zeke Hausfather for
making and publishing the source code so quickly.

No comments

Categories: en, climatechange, tutorial


EU URHEBERRECHTSREFORM NACHLESE

2019-04-23 22:35

Die EU Urheberrechtsreform ist seit dem 15. April endgültig beschlossen. Leider
konnten die strittigen Dinge wie Änderungen beim Leistungsschutzrecht oder
drohende Uploadfilter nicht verhindert werden.

Zunächst ein bisschen Hintergrund zur Urheberrechtsreform. Die Reform sieht
mehrere Änderungen vor, die durchaus nicht alle schlecht sind, einen guten
Überblick zum Thema ist [dieser Blog-Eintrag von Julia Reda]. Noch vor der
Abstimmung gab es einen, meines Erachtens, sehr guten Gastbeitrag von Dorothee
Bär bei der Main Post, bei dem auf die negativen Auswirkungen der Reform
eingegangen wird.

Darüber hinaus gab es auch die Warnung vom UN-Sonderberichterstatter zur
Meinungsfreiheit David Kayne davor, dass die Reform zur Einschränkung der
Meinungsfreiheit führen wird.

Bereits einen Tag nach der Zustimmung des EU Parlaments am 26. März hatte der
französische Kulturminister Franck Riester angekündigt, dass Frankreich in
Filtertechnologie investieren will. Es wird also wohl trotz der Protokollnotiz
Deutschlands zu Uploadfiltern kommen.

Was mich bei der gesamten Debatte wirklich gestört hat, war die Unwissenheit
vieler Beteiligter. Auch ich habe Fehler gemacht, diesen aber umgehend
korrigiert. Dabei ist es eben hilfreich auch die Argumente der Gegenseite zu
hören. Besonders bei der Debatte im EU Parlament direkt vor der Abstimmung
(direkt als Video) wurden deutlich, wie viele Abgeordnete nicht wirklich
verstanden haben, um was es genau geht, bzw. welche Folgen die Reform hat. Es
gab sogar plumpe Angriffe, die mit der eigentlichen Sache zu tun hatten. Es kam
bei den Befürwortern wohl nicht an, dass auch die Gegner, wie ich, eine
Urheberrechtsform wollen, es geht lediglich um deren Umsetzung. Der Redebeitrag
von Julia Reda war (wie so oft) hervorragend. Sie fasst die Faktenlage noch
einmal kurz zusammen und beschreibt auch den Frust der Teilnehmer der
Massenproteste. Zur möglichen Folge der Politikverdrossenheit gibt es auch einen
sehr guten Kommentar auf tagesschau.de.

Spannend ist auch noch die Frage, wer die eigentlichen Gewinner dieser Reform
sind. Die Befürworter haben immer die Urheber als Gewinner der Reform ins Feld
geführt. Allerdings handelte es dabei immer um Urheber die von
Verwertungsgesellschaften vertreten werden. Es wurde dabei außer Acht gelassen,
dass es gerade im Internetzeitalter eine Vielzahl von anderen Möglichkeiten gibt
Urheber zu sein. Dazu gibt es zwei nette Geschichten, einmal der Versuch als
Privatperson für Fotos Vertreten zu werden, das andere Mal als Videoproducer auf
YouTube. Beides ist derzeit nicht möglich.

Ich hatte mich außerdem daran beteiligt bei den Brüsseler Büros der Abgeordneten
anzurufen. Dabei war das Feedback sehr verschieden. Es reichte von einem
freundlichen „ich werde dies so weitergeben“, über „die Abgeordnete stimmt wie
ihre Kollegin, entgegen der Mehrheit ihrer Fraktion, bzw. der Bundesfraktion“
bis zu „es rufen so viele an, ich werde deshalb nicht mit Ihnen sprechen, die
Abgeordnete wird sich aber eine fundierte Meinung bilden“ (im Endeffekt dann
aber gar nicht abstimmen).

Zum Schluss möchte ich mich bei Allen bedanken die so hart dagegen gekämpft
haben, Demos organisiert und natürlich auch den zahlreichen Teilnehmern.

No comments

Categories: de, EU, politics, copyright


WHY I AM AGAINST THE EU COPYRIGHT DIRECTIVE

2019-03-17 22:35

Update 2019-03-19: The argumentation below is wrong. A forum won't be considered
a "online content sharing service provider" according to the definition of
Article 2 (5) (page 51 of the full text of the final version). I'm sorry for
this misinformation. I keep the text below for reference so that others can see
what I got wrong.

There are many arguments against the EU Copyright Directive (more correct
Directive on Copyright in the Digital Single Market) some I agree with, some I
don't. Hence, here's my take on why I think that directive should be stopped.
Short versions is: it strengthens the big platforms and weakens/destroys the
small ones.

My hope is that this blog post will get more people interested in that topic and
hopefully make you join the European wide protests on Saturday March 23rd 2019.
If you want to join, there's an interactive map of all known protests created by
the folks from stopACTA2.


INTRO

It is confusing that platforms like YouTube are against the directive, it sounds
like they have a lot to lose, hence they try everything they can against it. For
me, this is normally a sign, that such a directive is exactly what it should do.

But it this case, it's not. YouTube will surely have its own reasons being
against it. But what is more important for me is, that if the directive is
approved by the European Parliament, the small platforms will almost have no
chance to survive.


WHY SMALL PLATFORMS WILL DIE

There are exceptions in the directive for some platforms. You can find those in
the full text of the final version at paragraph (38b), page 36. But those
exceptions won't help all smaller platforms. For example a discussion board
which is older than 3 years and has advertising to cover the server costs
wouldn't be excluded. They would be liable for every copyright infringement.

It could be a small as a profile picture, let's say yours is Luke Skywalker.
That platform could block custom profile pictures, but that still won't be
enough. Someone could post some copyrighted text. But how would you make a
discussion board without the users being able to post text? So the only way to
not being liable would be to check for all infringements (how would you do
that?), or close the platform.


OUTRO

I tried to keep it intentionally short and highlight the issue that matters to
me most. Of course there's a lot more issues regarding the EU Copyright
Directive, so if you want to know more, go to websites like
savetheinternet.info, stopACTA2 or Julia Reda's website who is a Member of the
European Parliament and puts lots of efforts in explaining and spreading the
word on why the directive should be stopped (also follow her on Twitter). Thanks
a lot Julia for doing such an amazing work!

No comments

Categories: en, EU, politics, copyright


JOINING PROTOCOL LABS

2018-01-24 22:35

I’m pumped to announce that I’m joining Protocol Labs as a software engineer.
Those following me on Twitter or looking on my GiHub activity might have already
got some hints.


SHORT TERM

My main focus is currently on IPLD (InterPlanetary Linked Data). I’ll smooth
things out and also work on the IPLD specs, mostly on IPLD Selectors. Those IPLD
Selectors will be used to make the underlying graph more efficient to traverse
(especially for IPFS). That’s a lot of buzzwords, I hope it will get clearer the
more I’ll blog about this.

To get started I’ve done the JavaScript IPLD implementations for Bitcoin and
Zcash. Those are the basis to make easy traversal through the Bitcoin and Zcash
blockchains possible.


LONGER TERM

In the longer term I’ll be responsible to bring IPLD to Rust. That’s especially
exciting with Rust’s new WebAssembly backend. You’ll get a high performance Rust
implementation, but also one that works in Browsers.


WHAT ABOUT NOISE?

Many of you probably know that I’ve been working full-time on Noise for the past
1.5 years. It shapes up nicely and is already quite usable. Of course I don’t
want to see this project vanish and it won’t. At the moment I only work
part-time at Protocol Labs, to also have some time for Noise. In addition to
that there’s also interest within Protocol Labs to use Noise (or parts of it)
for better query capabilities. So far it’s only rough ideas I mentioned briefly
at the end of my talk about Noise at the [Lisbon IPFS Meetup] two weeks ago. But
what’s the distributed web without search?


WHAT ABOUT GEO?

I’m also part of the OSGeo community and FOSS4G movement. So what’s the future
there? I see a lot of potential in the Sneakernet. If geo-processing workflows
are based around IPFS, you could use the same tools/scripts whether it is stored
somewhere in the cloud, or access you local mirror/dump if your Internet
connection isn’t that fast/reliable.

I expect non-realiable connectivity to be a hot topic at the FOSS4G 2018
conference in Dar es Salaam, Tansania.


CONCLUSION

I’m super excited. It’s a great team and I’m looking forward to push the
distributed web a bit forward.

No comments

Categories: en, ProtocolLabs, IPLD, IPFS, JavaScript, Rust, geo


INTRODUCTION TO NOISE’S NODE.JS API

2017-12-21 22:35

In the previous blog post about Noise we imported data with the help of some
already prepared scripts. This time it’s an introduction in how to use Noise‘s
Promise-based Node.js API directly yourself.

The dataset we use is not a ready to use single file, but one that consists of
several ones. The data is the “Realized Cost Savings and Avoidance” for US
government agencies. I’m really excited that such data gets openly published as
JSON. I wished Germany would be that advanced in this regard. If you want to
know more about the structure of the data, there’s documentation about the [JSON
Schmema], they even have a “OFCIO JSON User Guide for Realized Cost Savings” on
how to produce the data out of Excel.

I’ve prepared a repository containing the final code and the data. But feel free
to follow along this tutorial by yourself and just point to the data directory
of that repository when running the script.

Let’s start with the boilerplate code for reading in those files and parsing
them as JSON. But first create a new package:

mkdir noise-cost-savings
cd noise-cost-savings
npm init --force


You can use --force here as you probably won’t publish this package anyway. Put
the boilerplate code below into a file called index.js. Please note that the
code is kept as simple as possible, for a real world application you surely want
better error handling.

#!/usr/bin/env node
'use strict';

const fs = require('fs');
const path = require('path');

// The only command line argument is the directory where the data files are
const inputDir = process.argv[2];
console.log(`Loading data from ${inputDir}`);

fs.readdir(inputDir, (_err, files) => {
  files.forEach(file => {
    fs.readFile(path.join(inputDir, file), (_err, data) => {
      console.log(file);
      const json = JSON.parse(data);
      processFile(json);
    });
  });
});

const processFile = (data) => {
  // This is where our actual code goes
};


This code should already run. Checkout my repository with the data into some
directory first:

git clone https://github.com/vmx/blog-introduction-to-noises-nodejs-api


Now run the script from above as:

node index.js <path-to-directory-from-my–repo-mentioned-above>/data


Before we take a closer look at the data, let’s install the Noise module. Please
note that you need to have Rust installed (easiest is probably through rustup)
before you can install Noise.

npm install noise-search


This will take a while. So let’s get back to code. Load the noise-search module
by adding:

const noise = require('noise-search');


A Noise index needs to be opened and closed properly, else your script will hang
and not terminate. Opening a new Noise index is easy. Just put this before
reading the files:

const index = noise.open('costsavings', true);


It means that open an index called costsavings and create it if it doesn’t exist
yet (that’s the boolean true). Closing the index is more difficult due to the
asynchronous nature of the code. We can close the index only after all the
processing is done. Hence we wrap the fs.readFile(…) call in a Promise. So that
new code looks like this:

fs.readdir(inputDir, (_err, files) => {
  const promises = files.map(file => {
    return new Promise((resolve, reject) => {
      fs.readFile(path.join(inputDir, file), (err, data) => {
        if (err) {
          reject(err);
          throw err;
        }

        console.log(file);
        const json = JSON.parse(data);
        resolve(processFile(json));
      });
    });
  });
  Promise.all(promises).then(() => {
    console.log("Done.");
    index.close();
  });
});


If you run the script now it should print out the file names as before and
terminate with a Done.. There got a directory called costsavings created after
you ran the script. This is where the Noise index is stored in.

Now let’s have a look at the data files, e.g. the cost savings file from the
Department of Commerce (or the JSON Schema), you’ll see that it has a single
field called "strategies", which contains an array with all strategies. We are
free to pre-process the data as much as we want before we insert it into Noise.
So let’s create a separate document for every strategy. Our processFile()
function now looks like:

const processFile = (data) => {
  data.strategies.forEach(async strategy => {
    // Use auto-generated Ids for the documents
    await index.add(strategy);
  });
};


Now all the strategies get inserted. Make sure you delete the index (the
costsavings directory) if you re-run the scripts, else you would end up with
duplicated entries, as different Ids will be generated on every run.

To query the index you could use the Noise indexserve script that I’ve also used
in the last blog post about Noise. Or we just add a small query at the end of
the script after the loading is done. Our query function will do the query and
output the result:

const queryNoise = async (query) => {
  const results = await index.query(query);
  for (const result of results) {
    console.log(result);
  }
};


There’s not much to say, except it’s again a Promised-based API. And now hook up
this function after the loading and before the index is closed. For that,
replace the Promise.all(…) call with:

Promise.all(promises).then(async () => {
  await queryNoise('find {} return count()');
  console.log("Done.");
  index.close();
});


It’s a really simple query, it just returns the number of documents that are in
there (644). After all this hard work, it’s time to make a more complicated
query on this dataset to show that it was worth doing all this. Let’s return the
total net savings of all agencies in 2017. Replace the query find {} return
count() with:

find {fy2017: {netOrGross: == "Net"}} return sum(.fy2017.amount)


That’s $845m savings. Not bad at all!

You can learn more about the Noise Node.js API from the README at the
corresponding repository. If you want to learn more about possible queries, have
a look at the Noise Query Language reference.

Happy cost saving!

No comments

Categories: en, Noise, Node, JavaScript, Rust


EXPLORING DATA WITH NOISE

2017-12-12 22:35

This is a quick introduction on how to explore some JSON data with Noise. We
won’t do any pre-processing, but just load the data into Noise and see what we
can do with it. Sometimes the JSON you get needs some tweaking before further
analysis makes sense. For example you want to rename fields or numbers are
stored as string. This exploration phase can be used to get a feeling for the
data and which parts might need some adjustments.

Finding decent ready to use data that contains some nicely structured JSON was
harder than I thought. Most datasets are either GeoJSON or CSV masqueraded as
JSON. But I was lucky and found a JSON dump of the CVE database provided by
CIRCL. So we’ll dig into the CVEs (Common Vulnerabilities and Exposures)
database to find out more about all those security vulnerabilities.

Noise has a Node.js binding to get started easily. I won’t dig into the API for
now. Instead I’ve prepared two scripts. One to load the data from a file
containing new line separated JSON. And another one for serving up the Noise
index over HTTP, so that we can explore the data via curl.


PREREQUISITES

As we use the Node.js binding for Noise, you need to have Node.js, npm and Rust
(easiest is probably through rustup) installed.

I’ve created a repository with the two scripts mentioned above plus a subset of
the CIRCL CVE dataset. Feel free to download the full dataset from the CIRCL
Open Data page (1.2G unpacked) and load it into Noise. Please note that Noise
isn’t performance optimised at all yet. So the import takes some time as the
hard work of all the indexing is done on insertion time.

git clone https://github.com/vmx/blog-exploring-data-with-noise
cd blog-exploring-data-with-noise
npm install


Now everything we need should be installed, let’s load the data into Noise and
do a query to verify it’s installed properly.


LOADING THE DATA AND VERIFY INSTALLATION

Loading the data is as easy as:

npx dataload circl-cve.json


For every inserted record one dot will be printed.

To spin up the simple HTTP server, just run:

npx indexserve circl-cve


To verify it does actually respond to queries, try:

curl -X POST http://127.0.0.1:3000/query -d 'find {} return count()'


If all documents got inserted correctly it should return

[
1000
]


Everything is set up properly, now it’s time to actually exploring the data.


EXPLORING THE DATA

We don’t have a clue yet, what the data looks like. So let’s start with looking
at a single document:

curl -X POST http://127.0.0.1:3000/query -d 'find {} return . limit 1'
[
{
  "Modified": "2017-01-02 17:59:00.147000",
  "Published": "2017-01-02 17:59:00.133000",
  "_id": "34de83b0d3c547c089635c3a8b4960f2",
  "cvss": null,
  "cwe": "Unknown",
  "id": "CVE-2017-5005",
  "last-modified": {
    "$date": 1483379940147
  },
  "references": [
    "https://github.com/payatu/QuickHeal",
    "https://www.youtube.com/watch?v=h9LOsv4XE00"
  ],
  "summary": "Stack-based buffer overflow in Quick Heal Internet Security 10.1.0.316 and earlier, Total Security 10.1.0.316 and earlier, and AntiVirus Pro 10.1.0.316 and earlier on OS X allows remote attackers to execute arbitrary code via a crafted LC_UNIXTHREAD.cmdsize field in a Mach-O file that is mishandled during a Security Scan (aka Custom Scan) operation.",
  "vulnerable_configuration": [],
  "vulnerable_configuration_cpe_2_2": []
}
]


The query above means: “Find all documents without restrictions and return it’s
full contents. Limit it to a single result”.

You don’t always want to return all documents, but filter based on certain
conditions. Let’s start with the word match operator ~=. It matches document
which contains those words in a specific field, in our case "summary". As
“buffer overflow” is a common attack vector, let’s search for all documents that
contain it in the summary.

curl -X POST http://127.0.0.1:3000/query -d 'find {summary: ~= "buffer overflow"}'
[
"34de83b0d3c547c089635c3a8b4960f2",
"8dff5ea0e5594e498112abf1c222d653",
"741cfaa4b7ae43909d1da153747975c9",
…
"b7419042c9464a7b96d3df74451cb4a7",
"d379e9fda704446982cee8638f32e72b"
]


That’s quite a long list of random characters. Noise assigns Ids to every
inserted document if the document doesn’t contain a "_id" field. By default
Noise returns such Ids of the matching documents. So no return value is
equivalent to return ._id. Let’s return the CVE number of the matching
vulnerabilities instead. That field is called "id":

curl -X POST http://127.0.0.1:3000/query -d 'find {summary: ~= "buffer overflow"} return .id'
[
"CVE-2017-5005",
"CVE-2016-9942",
…
"CVE-2015-2710",
"CVE-2015-2666"
]


If you want to know how many there are, just append a return count() to the
query:

curl -X POST http://127.0.0.1:3000/query -d 'find {summary: ~= "buffer overflow"} return count()'
[
61
]


Or we can of course return the full documents to see if there are further
interesting things to look at:

curl -X POST http://127.0.0.1:3000/query -d 'find {summary: ~= "buffer overflow"} return .'
…


I won’t post the output here, it’s way too much. If you scroll through the
output, you’ll see that some contain a field named "capec", which is probably
about the Common Attack Pattern Enumeration and Classification. Let’s have a
closer look at one of those, e.g. from “CVE-2015-8388”:

curl -X POST http://127.0.0.1:3000/query -d 'find {id: == "CVE-2015-8388"} return .capec'
[
[
  {
    "id": "15",
    "name": "Command Delimiters",
    "prerequisites": …
    "related_weakness": [
      "146",
      "77",
      …
    ],
    "solutions": …
    "summary": …
  },
  …


This time we’ve used the exact match operator ==. As the CVEs have a unique Id,
it only returned a single document. It’s again a lot of data, we might only care
about the CAPEC names, so let’s return those:

curl -X POST http://127.0.0.1:3000/query -d 'find {id: == "CVE-2015-8388"} return .capec[].name'
[
[
  "Command Delimiters",
  "Flash Parameter Injection",
  "Argument Injection",
  "Using Slashes in Alternate Encoding"
]
]


Note that it is an array of an array. The reason is that in this case we only
return the CAPEC names of a single document, but our filter condition could of
course match more documents, like the word match operator did when we were
searching for “buffer overlow”.

Let’s find out all CVEs where the CAPEC name “Directory Traversal”.

curl -X POST http://127.0.0.1:3000/query -d 'find {capec: [{name: == "Command Delimiters"}]} return .id'
[
"CVE-2015-8389",
"CVE-2015-8388",
"CVE-2015-4244",
"CVE-2015-4224",
"CVE-2015-2265",
"CVE-2015-1986",
"CVE-2015-1949",
"CVE-2015-1938"
]


The CAPEC data also contains references to related weaknesses as we’ve seen
before. Let’s return the related_weakness of all CVEs that have the CAPEC name
“Command Delimiters”.

curl -X POST http://127.0.0.1:3000/query -d 'find {capec: [{name: == "Command Delimiters"}]} return {cve: .id, related: .capec[].related_weakness}'
[
{
  "cve": "CVE-2015-8389",
  "related": [
    [
      "146",
      "77",
      …
    ],
    [
      "184",
      "185",
      "697"
    ],
    …
  ]
},
{
  "cve": "CVE-2015-8388",
  "related": [
  …
  ]
},
…
]


That’s not really what we were after. This returns the related weaknesses of all
CAPECs and not just the one named “Command Delimiters”. The solution is a so
called bind variable. You can store an array element that matches a condition in
a variable which can then be re-used in the return value.

Jut prefix the array condition with a variable name separated by two colons:

find {capec: commdelim::[{name: == "Command Delimiters"}]}


And use it in the return value like any other path:

return {cve: .id, related: commdelim.related_weakness}


So the full query is:

curl -X POST http://127.0.0.1:3000/query -d 'find {capec: commdelim::[{name: == "Command Delimiters"}]} return {cve: .id, related: commdelim.related_weakness}'
[
{
  "cve": "CVE-2015-8389",
  "related": [
    [
      "146",
      "77",
      …
    ]
  ]
},
{
  "cve": "CVE-2015-8388",
  "related": [
    [
      "146",
      "77",
      …
    ]
  ]
},
…
]


The result isn’t that exciting as it’s the same related weaknesses for all CVEs,
but of course the could be completely arbitrary. There’s no limitation on the
schema.

So far we haven’t done any range requests yet. So let’s have a look at all CVEs
that were last modified on December 28th with “High” severity rating according
to the Common Vulnerability Scoring System. First we need to determine the
correct timestamps:

date --utc --date="2016-12-28" "+%s"
1482883200
date --utc --date="2016-12-29" "+%s"
1482969600


Please note that the "last-modified" field has timestamps with 13 characters
(ours have 10), which means that they are in milliseconds, so we just append
three zeros and we’re good. The severity rating is stored in the field "cvss”,
“High” severity means a value from 7.0–8.9. We need to put the field name
last-modified in quotes as it contains a dash (just as you’d do it in
JavaScript). The final query is:

curl -X POST http://127.0.0.1:3000/query -d 'find {"last-modified": {$date: >= 1482883200000, $date: < 1482969600000}, cvss: >= 7.0, cvss: <=8.9} return .id'
[
"CVE-2015-4199",
"CVE-2015-4200",
"CVE-2015-4224",
"CVE-2015-4227",
"CVE-2015-4230",
"CVE-2015-4234",
"CVE-2015-4208",
"CVE-2015-4526"
]


This was an introduction into basic querying of Noise. If you want to know about
further capabilities you can have a look at the Noise Query Language reference
or stay tuned for further blog posts.

Happy exploration!

No comments

Categories: en, Noise, Node, JavaScript, Rust

Next page

By Volker Mische

Powered by Kukkaisvoima version 7