[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Outreachy project - Xen Code Review Dashboard



On Sun, 2017-04-09 at 21:50 -0700, Heather Booker wrote:
> Hi Jesus,
> 
> While using the Elasticsearch python library
> (https://elasticsearch-py.readthedocs.io/en/master/) to add mbox
> messages to an index, I would get a UnicodeEncodeError:
> "'utf-8' codec can't encode character '\udca0' in position 767:
> surrogates not allowed".
> 

What happens here is that Perceval has some assumptions about character
encoding, when reading messages (to convert them to Unicode strings).
If they are not fulfilled, it converts the character as "surrogate".
When trying to produce utf8 from those, that cannnot be done, since the
space for "surrogate" Unicode is thought to convert back to the
original encoding. But JSON expects the encoding to be utf8, so no luck
here.

The trick is to provide a serializer which either skips those messages,
or produces a "escaped" encoding for them.

See http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/ for
a detailed explanation.

Please, let me know if you can work from here...

        Jesus.

> Investigating in Grimoire elk https://github.com/grim
> oirelab/GrimoireELK/blob/96b00bc682485976104a6825ca63ae0
> 8639deacc/grimoire_elk/elk/mbox.py#L200 seems to show that 
> perhaps that tool instead uses Latin-1 encoding, but I found that
> to then produce a serialization error (their custom error message:
> "Unable to serialize %r (type: %s)"). I suppose this is because
> now it's bytes; of course, converting back to string after encoding
> just cycles back to the first error.
> 
> As somewhat of a Python newbie I don't really know how to tackle
> this! My thought atm is to splice the offending character out
> of the message. 
> 
> And to clarify, my understanding is that the final result of this
> task
> is an index of Xen data, with two types: commits and messages.
> Each commit document should contain its original information
> from git, plus the name of the branch it was developed in. And
> should only the mbox messages which appear to be associated
> with a specific commit exist in the final index? Is there some
> key information in messages that is supposed to indicate the
> association of a given commit with a git branch? I would be
> grateful if you could specify the end goal a little more. :D
> 
> Thanks so much!
> 
> Heather
> 
> 
> 
> On Sat, Apr 8, 2017 at 10:02 AM, Jesus M. Gonzalez-Barahona <jgb@bite
> rgia.com> wrote:
> > On Fri, 2017-04-07 at 15:49 -0700, Heather Booker wrote:
> > > Hi Jesus, 
> > >
> > > Thanks for your reply!
> > >
> > > So about the task, instructions say after analyzing mboxes with
> > > Perceval to
> > > "store the resulting raw index in ElasticSearch" - what does raw
> > > index mean?
> > 
> > In this context, I mean "storing the JSON documents produced by
> > Perceval in an ElasticSearch index, as such". ElasticSearch stores
> > JSON
> > documents, so it is just uploading the output of Perceval to it.
> > 
> > > In terms of figuring out the elasticsearch structure, do I want
> > an
> > > index
> > > (xen-devel mbox) with a type (message) and each object from the
> > > perceval
> > > output to be one document? Or should it be more fine-grained?
> > 
> > Exactly.
> > 
> > Saludos,
> > 
> >         Jesus.
> > 
> > > Cheers,
> > >
> > > Heather
> > >
> > > On Thu, Apr 6, 2017 at 7:05 AM, Jesus M. Gonzalez-Barahona <jgb@b
> > iter
> > > gia.com> wrote:
> > > > On Wed, 2017-04-05 at 16:43 -0700, Heather Booker wrote:
> > > > > Hi!
> > > > >
> > > > > I'd love to work on the Code Review Dashboard project for
> > this
> > > > round
> > > > > of Outreachy.
> > > >
> > > > Great!!
> > > >
> > > > > Are the steps outlined
> > > > > here http://markmail.org/message/7adkmords3imkswd still the
> > first
> > > > > contribution you'd like to see?
> > > >
> > > > Yes.
> > > >
> > > > > So is this a project that has been worked on in previous
> > rounds
> > > > of
> > > > > GSOC/Outreachy also?
> > > > > If so is there a place to find links to the previous
> > participants
> > > > > blogs? :)
> > > >
> > > > No. We had one participation at some point, but couldn't even
> > start
> > > > for
> > > > personal reasons. There are some people considering working on
> > this
> > > > for
> > > > this next round of Outreachy, however. You'll see their
> > messages in
> > > > this mailing list.
> > > >
> > > > > Should questions about how the specifications/completion of
> > the
> > > > > microtask be addressed to
> > > > > IRC or this list? If IRC, which channel - #xen-opw or
> > #metrics-
> > > > > grimoire? On that note, I'm 
> > > > > curious why #metrics-grimoire is the listed channel on the
> > > > project
> > > > > page - are main contributors
> > > > > involved in both projects? Or is it just because the Xen
> > > > dashboard
> > > > > doesn't have a channel?
> > > >
> > > > The code review is for the Xen project, but it is done with (I
> > > > mean,
> > > > the ssoftware used for it is) GrimoireLab, which for historical
> > > > reasons
> > > > uses the #metrics-grimoire channel. That's why it is likely
> > that
> > > > you
> > > > find somebody from the project there.
> > > >
> > > > If you have questions, and find me around in IRC, please ping
> > me.
> > > > If
> > > > I'm not available, please send an email message.
> > > >
> > > > Saludos,
> > > >
> > > >         Jesus.
> > > >
> > > > > Thanks!
> > > > >
> > > > > Heather
> > > > > _______________________________________________
> > > > > Xen-devel mailing list
> > > > > Xen-devel@xxxxxxxxxxxxx
> > > > > https://lists.xen.org/xen-devel
> > > > --
> > > > Bitergia: http://bitergia.com
> > > > /me at Twitter: https://twitter.com/jgbarah
> > > >
> > > >
> > >
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@xxxxxxxxxxxxx
> > > https://lists.xen.org/xen-devel
> > --
> > Bitergia: http://bitergia.com
> > /me at Twitter: https://twitter.com/jgbarah
> > 
> > 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> https://lists.xen.org/xen-devel
-- 
Bitergia: http://bitergia.com
/me at Twitter: https://twitter.com/jgbarah


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.